SQL Interview Ultimate Guide Concepts, Questions & Advanced

SQL Interview Ultimate Guide Concepts, Questions & Advanced

1. Introduction to SQL Interviews

Purpose of SQL Interviews

Ever wondered why companies are so keen on your SQL skills? It's more than just knowing a few commands!

The Big Picture: SQL interviews primarily assess your analytical thinking and problem-solving abilities, not just your memorization of syntax.

There's a constant trade-off in the real world between writing a query quickly and ensuring it performs optimally. Interviewers want to see how you navigate this spectrum.

Consider the progression of SQL proficiency:

Syntax Recall

Can you write basic SELECT, FROM, WHERE?

Logical Problem Solving

Can you translate business questions into query logic using joins, aggregations, subqueries?

Query Optimization

Can you write efficient queries that scale, considering indexes, execution plans, and data volume?

Most interviews target the Logical Problem Solving stage, often hinting at the need for Query Optimization.

Types of SQL Questions Asked

SQL interviews aren't one-size-fits-all. You might encounter a variety of question formats:

📝 Definition-based (Theory)

Questions about core concepts like ACID properties, Normalization Forms, types of keys, or the difference between `DELETE` and `TRUNCATE`.

💻 Query Writing (Whiteboard/Live coding)

The most common type. You'll be given a schema and a business problem, then asked to write a SQL query to solve it.

📐 Database Design (Schema/Modeling)

You might need to design a simple database schema given a set of requirements, identifying tables, columns, and relationships.

🔍 Debugging/Optimization

Presented with an existing (often inefficient or incorrect) query, you'll be asked to fix it, optimize it, or explain its flaws.

Strategies for Answering Effectively

Beyond knowing SQL, how you approach and articulate your solution can make a huge difference.

  • Clarify Requirements: Before writing any code, ask clarifying questions. What are the expected inputs and outputs? Are there any edge cases?
  • State Assumptions: Explicitly mention any assumptions you're making about the data (e.g., "I'm assuming user_ids are unique and non-NULL in the Users table."). This shows thoughtful consideration.
  • Formatting and Readability: Write clean, well-formatted SQL. Use consistent capitalization, proper indentation, and comments where necessary. A readable query is easier to debug and understand.
  • Think Out Loud: Walk the interviewer through your thought process. Explain *why* you're choosing a `LEFT JOIN` over an `INNER JOIN`, or *why* you're using a CTE.

2. SQL Fundamentals, Prerequisites & Definitions

Relational Model Foundations

At the heart of SQL lies the Relational Model, a way of organizing data into structured tables. Understanding these foundational terms is crucial.

Key Insight: SQL databases are built on the mathematical concept of sets and relations, ensuring data integrity and consistency.

Let's break down the core components:

Anatomy of a Relational Table

user_id (PK) username email registration_date
101 alice_smith alice@example.com 2023-01-15
102 bob_jones bob@example.com 2023-01-18
103 charlie_brown charlie@example.com 2023-02-01
  • Database: The overarching container for all your schemas, tables, views, etc. (e.g., "MyCompanyDB").
  • Schema: A logical collection of database objects, often used to group related tables or define access permissions. (e.g., "public", "sales_data").
  • Table (Relation): The primary structure where data is stored, consisting of rows and columns. The entire structure above is a table.
  • Row (Tuple/Record): A single entry or instance of data in a table. In the table above, (101, alice_smith, alice@example.com, 2023-01-15) is one row.
  • Column (Attribute/Field): A specific type of data that each row contains. username, email, registration_date are columns.
  • Sets: In the relational model, tables are often considered "sets" of rows (tuples). Each row is unique, and the order of rows doesn't inherently matter.

Data Types & Nuances

Choosing the right data type for each column is crucial for data integrity, storage efficiency, and query performance.

  • 🔢 Numeric:
    • INT: Whole numbers (e.g., 1, 1000).
    • BIGINT: Larger whole numbers.
    • FLOAT/REAL: Approximate decimal numbers (less precise, good for scientific data).
    • DECIMAL/NUMERIC/MONEY: Exact decimal numbers (crucial for financial data, precise calculations).
  • 📝 String:
    • CHAR(n): Fixed-length string. If you store 'hi' in `CHAR(10)`, it still takes 10 bytes (padded with spaces). Faster for fixed-size data.
    • VARCHAR(n): Variable-length string. Stores only what's needed (plus a small overhead). More common.
    • TEXT: Very large variable-length strings, no practical length limit in many systems.
  • Date/Time:
    • DATE: Stores only the date (YYYY-MM-DD).
    • TIME: Stores only the time (HH:MM:SS).
    • TIMESTAMP: Stores date and time.
    • TIMESTAMPTZ (PostgreSQL)/DATETIME2 (SQL Server): Stores date and time with timezone awareness. Crucial for applications spanning multiple timezones.
  • Boolean:
    • BOOLEAN (PostgreSQL)/BIT (SQL Server)/TINYINT(1) (MySQL): Stores true/false values.

The Concept of NULL

NULL is one of the most misunderstood concepts in SQL. It does not mean zero, an empty string, or false.

  • Definition: NULL represents missing or unknown data.
  • Missing vs. Empty vs. Zero:
    • NULL: Data is unknown or not applicable.
    • '' (empty string): Data is known, but it's an empty text value.
    • 0 (zero): Data is known, and its numerical value is zero.

The presence of NULL introduces "Three-Valued Logic" into SQL: TRUE, FALSE, and UNKNOWN. This has significant implications for conditions in `WHERE` and `HAVING` clauses.

Truth Table: AND/OR Interactions with NULL

Condition A Operator Condition B Result (A AND B) Result (A OR B)
TRUE AND TRUE TRUE TRUE
TRUE AND FALSE FALSE TRUE
TRUE AND NULL UNKNOWN TRUE
FALSE AND NULL FALSE UNKNOWN
NULL AND NULL UNKNOWN UNKNOWN
FALSE OR FALSE FALSE FALSE
NULL OR TRUE TRUE TRUE
Watch Out! A common mistake is using = NULL or != NULL. These comparisons always result in UNKNOWN. You must use IS NULL or IS NOT NULL to correctly filter for or against NULL values.

Keys & Relationships

Keys define relationships between tables and enforce data integrity.

🔑 Primary Key (PK)

  • Uniquely identifies each row in a table.
  • Must contain unique values and cannot be NULL.
  • Natural Key: A column that naturally exists and is unique (e.g., `ISBN` for a book).
  • Surrogate Key: An artificial key, often an auto-incrementing integer, created specifically to be the PK (e.g., `user_id`). Recommended for flexibility.

🔗 Foreign Key (FK)

  • A column (or set of columns) in one table that refers to the Primary Key of another table.
  • Establishes and enforces a link between the data in two tables.
  • Ensures referential integrity: you can't have an order for a customer that doesn't exist.

🧩 Composite Key

  • A Primary Key consisting of two or more columns whose values, when combined, uniquely identify each row in a table.
  • Used when no single column can guarantee uniqueness.

Conceptual ER Diagram: Customers & Orders

Customers Table

customer_id (PK, INT) 🔑

first_name (VARCHAR)

last_name (VARCHAR)

email (VARCHAR, UNIQUE)

Orders Table

order_id (PK, INT) 🔑

customer_id (FK, INT) 🔗

order_date (DATE)

total_amount (DECIMAL)

In this conceptual diagram, the customer_id in the Orders table is a Foreign Key that links back to the customer_id Primary Key in the Customers table. This ensures every order belongs to an existing customer.

SQL Command Categories (The Big Picture)

SQL commands are broadly categorized by their purpose:

⚙️ DDL (Data Definition Language)

Commands that define or manage the database structure.

  • CREATE (DATABASE, TABLE, INDEX, VIEW)
  • ALTER (TABLE, INDEX, VIEW)
  • DROP (DATABASE, TABLE, INDEX, VIEW)
  • TRUNCATE (TABLE - resets data, keeps structure)
  • RENAME

💾 DML (Data Manipulation Language)

Commands that manipulate the data within the database objects.

  • SELECT (Retrieve data)
  • INSERT (Add new data)
  • UPDATE (Modify existing data)
  • DELETE (Remove existing data)

🔒 DCL (Data Control Language)

Commands that manage permissions and access control to the database.

  • GRANT (Give user permissions)
  • REVOKE (Remove user permissions)

🔄 TCL (Transaction Control Language)

Commands that manage transactions in the database.

  • COMMIT (Save changes permanently)
  • ROLLBACK (Undo changes)
  • SAVEPOINT (Set a point within a transaction to roll back to)

3. Deep Dive: Order of Execution (Crucial Concept)

Understanding the order in which a SQL query is processed by the database engine is one of the most fundamental and often overlooked concepts. It's the key to writing correct, efficient, and debuggable queries.

Why it matters: The order you write a SQL query is different from the order the database executes it. This distinction explains many common errors and dictates how you structure your logic.

Syntactical Order (How you write it)

This is the sequence in which you typically type out your SQL query. It's designed for human readability and clarity.

SELECT
FROM/JOIN
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT

Logical Order (How the engine runs it)

This is the internal sequence the database follows to process your query. Each step produces an intermediate result set that is passed to the next step. Understanding this is vital for predicting query behavior.

1. FROM/JOIN
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY
7. LIMIT
Crucial Implication: Why aliases in SELECT don't work in WHERE

Notice that SELECT (step 5) occurs after WHERE (step 2). This means any column aliases defined in your SELECT list are not yet "known" to the database engine when it processes the WHERE clause. You cannot filter by an alias defined in the same query's SELECT statement within its WHERE clause.

Example of an invalid query (due to logical order):

SELECT
    first_name,
    last_name,
    (salary * 1.10) AS adjusted_salary
FROM
    employees
WHERE
    adjusted_salary > 60000; -- ERROR: 'adjusted_salary' is unknown here!

To achieve this, you would either repeat the calculation in the WHERE clause, use a subquery, or a Common Table Expression (CTE).

Visual Comparison: Written Order vs. Execution Order

📝 Written Order (Human Perspective)

SELECT
FROM / JOIN
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT

🚀 Logical Execution Order (Database Perspective)

1. FROM / JOIN
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY
7. LIMIT

Practice & Application

🎯 Practice: Filtering by a Calculated Column (Alias Issue)

You need to find all employees whose estimated annual bonus, calculated as 10% of their salary, is greater than $5,000. Try to use an alias for the bonus calculation in your SELECT statement and then filter by it.

Click for Solution

This problem directly tests your understanding of the SQL query execution order. The WHERE clause is processed before the SELECT clause. Therefore, any alias defined in SELECT is not yet known when the WHERE clause attempts to filter by it, leading to an error.

Incorrect Attempt (will fail):

SELECT
    first_name,
    last_name,
    salary,
    (salary * 0.10) AS annual_bonus
FROM
    Employees
WHERE
    annual_bonus > 5000; -- ERROR: 'annual_bonus' is unknown here!

Correct Solution: To fix this, you must either repeat the calculation in the WHERE clause or use a subquery/CTE where the calculated column is established in an inner query and then filtered in an outer query.

-- Option 1: Repeat the calculation (simplest for single use)
SELECT
    first_name,
    last_name,
    salary,
    (salary * 0.10) AS annual_bonus
FROM
    Employees
WHERE
    (salary * 0.10) > 5000;

-- Option 2: Using a Subquery (more maintainable for complex calculations)
SELECT
    first_name,
    last_name,
    salary,
    annual_bonus
FROM
    (SELECT
        first_name,
        last_name,
        salary,
        (salary * 0.10) AS annual_bonus
    FROM
        Employees
    ) AS subquery_with_bonus
WHERE
    annual_bonus > 5000;

🎯 Practice: Filtering Rows vs. Filtering Groups

You need to find the average salary for each department, but only for employees hired after January 1, 2020. Additionally, you only want to see departments where this calculated average salary is greater than $65,000.

Click for Solution

This problem highlights the difference between WHERE and HAVING. WHERE filters individual rows before any grouping or aggregation takes place. HAVING filters entire groups after aggregation.

SELECT
    department,
    AVG(salary) AS avg_dept_salary
FROM
    Employees
WHERE
    hire_date > '2020-01-01' -- Filters individual rows BEFORE grouping
GROUP BY
    department
HAVING
    AVG(salary) > 65000; -- Filters groups AFTER aggregation

Explanation:

  • The FROM clause gathers all employee data.
  • The WHERE clause removes any employee records hired on or before '2020-01-01'.
  • The GROUP BY clause then aggregates the remaining employees by department.
  • The HAVING clause finally filters these aggregated groups, keeping only those departments whose average salary (of employees hired after 2020) exceeds $65,000.
  • Finally, SELECT projects the results.

🎯 Practice: Ordering by a Derived Column Alias

List all employees, showing their full name (first_name concatenated with last_name) and their salary. Order the results alphabetically by the full name.

Click for Solution

Unlike the WHERE clause, the ORDER BY clause is processed after the SELECT clause. This means any column aliases defined in your SELECT list are fully accessible and usable within the ORDER BY clause.

SELECT
    first_name,
    last_name,
    CONCAT(first_name, ' ', last_name) AS full_name,
    salary
FROM
    Employees
ORDER BY
    full_name ASC; -- 'full_name' alias is recognized here!

Explanation:

  • The FROM clause retrieves all employee data.
  • The SELECT clause calculates and aliases the full_name.
  • The ORDER BY clause then takes this result set and sorts it alphabetically using the newly created full_name column.

🎯 Practice: Retrieving Top N Ordered Results

Find the 3 most recently hired employees. Include their full name, department, salary, and hire date.

Click for Solution

This problem demonstrates the typical sequence of `ORDER BY` followed by `LIMIT` (or `TOP` in SQL Server). The data is first sorted according to the criteria, and then the specified number of rows is taken from the top of the sorted list.

SELECT
    CONCAT(first_name, ' ', last_name) AS full_name,
    department,
    salary,
    hire_date
FROM
    Employees
ORDER BY
    hire_date DESC -- Sort by hire date in descending order (most recent first)
LIMIT 3; -- Get only the top 3 rows from the sorted result set

Explanation:

  • The FROM and SELECT clauses prepare the initial set of data and columns.
  • The ORDER BY hire_date DESC clause arranges all employees from most recently hired to oldest hired.
  • Finally, the LIMIT 3 clause takes only the first three rows from this now-sorted list, giving you the 3 most recently hired employees.

4. Data Definition Language (DDL)

Data Definition Language (DDL) commands are used to define, modify, and manage the structure of database objects like databases, tables, indexes, and views. They deal with the schema rather than the data itself.

Key takeaway: DDL commands typically cause an implicit commit, meaning they cannot be rolled back in a transaction, unlike DML operations in some contexts. Use with caution!

Creating Structures

The foundation of any database interaction begins with defining its structure.

CREATE DATABASE

This command is used to create a new SQL database instance.

CREATE DATABASE MyCompanyDB;

CREATE TABLE

This is where you define the blueprint for your data, specifying column names, data types, and constraints.

Defining Constraints Inline vs. Externally:

Constraints (like Primary Keys, Foreign Keys, NOT NULL, UNIQUE, CHECK, DEFAULT) can be defined as part of the column definition (inline) or separately at the table level (external).

CREATE TABLE Employees (
    employee_id INT PRIMARY KEY,                       -- Inline PRIMARY KEY
    first_name VARCHAR(50) NOT NULL,                   -- Inline NOT NULL
    last_name VARCHAR(50) NOT NULL,
    email VARCHAR(100) UNIQUE,                         -- Inline UNIQUE
    hire_date DATE DEFAULT CURRENT_DATE,               -- Inline DEFAULT
    salary DECIMAL(10, 2) CHECK (salary > 0),          -- Inline CHECK
    department_id INT,
    -- External FOREIGN KEY definition
    CONSTRAINT fk_department
        FOREIGN KEY (department_id) REFERENCES Departments(department_id)
);

CREATE TABLE Departments (
    department_id INT PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL UNIQUE
);

Why choose one over the other?

  • Inline: Simpler for single-column constraints.
  • External: Required for multi-column (composite) constraints (e.g., `PRIMARY KEY (col1, col2)`), and often preferred for foreign keys for clarity and explicit naming.

Modifying Structures (`ALTER`)

Over time, your database schema will likely need adjustments. The `ALTER TABLE` command is your tool for this.

Adding vs. Dropping Columns

-- Add a new column to the Employees table
ALTER TABLE Employees
ADD COLUMN phone_number VARCHAR(20);

-- Drop an existing column from the Employees table
ALTER TABLE Employees
DROP COLUMN phone_number;

Changing Data Types

-- Change the data type of the email column (syntax varies by RDBMS)
-- PostgreSQL:
ALTER TABLE Employees
ALTER COLUMN email TYPE VARCHAR(150);

-- SQL Server:
ALTER TABLE Employees
ALTER COLUMN email VARCHAR(150);

-- MySQL:
ALTER TABLE Employees
MODIFY COLUMN email VARCHAR(150);
Risks of Data Loss:
  • Shrinking a column's length (e.g., VARCHAR(100) to VARCHAR(50)) can truncate existing data if values exceed the new limit.
  • Changing a data type (e.g., VARCHAR to INT) will fail if existing data cannot be converted (e.g., trying to convert 'abc' to an integer).
  • Always back up your database before performing significant `ALTER` operations on production systems.

Destructive Commands (Edge Cases)

These commands permanently remove data or structures. Use with extreme caution!

DROP (Remove structure + data)

DROP removes an entire database object (table, database, index, view). When you drop a table, its structure, indexes, permissions, and all its data are permanently deleted.

DROP TABLE OldDataArchive;
DROP DATABASE TestDatabase;

TRUNCATE (Reset data, keep structure, fast log)

TRUNCATE TABLE removes all rows from a table, effectively emptying it. Unlike `DELETE`, it does so by deallocating the data pages, making it much faster and using fewer transaction log resources, especially for large tables. It maintains the table structure, indexes, and constraints.

TRUNCATE TABLE TempLogTable;

Comparison: DELETE vs. TRUNCATE vs. DROP

Understanding the nuances of these destructive commands is vital for database management and preventing accidental data loss.

Feature DELETE TRUNCATE DROP
What it removes Rows (based on WHERE clause) All rows (data only) Table definition + all data
Is it DML or DDL? DML DDL DDL
Can it be rolled back? Yes (if in a transaction) No (implicitly commits in many RDBMS) No (implicitly commits)
Speed for large tables Slower (row-by-row logging) Faster (deallocates data pages) Fast
Uses WHERE clause? Yes No No
Resets AUTO_INCREMENT? No (unless all rows deleted in some systems) Yes (in most RDBMS) Yes (table recreated)
Fires Triggers? Yes (row-level triggers) No No
Retains structure/indexes? Yes Yes No

Practice & Application

🎯 Practice: Creating Tables with Mixed Constraints

You need to create two tables for an e-commerce platform: Categories and Products. The Categories table should have category_id (Primary Key, integer) and category_name (unique, not null, varchar(100)). The Products table should have:

  • product_id (Primary Key, integer, auto-incrementing if possible)
  • product_name (not null, unique, varchar(255))
  • price (decimal, must be greater than 0)
  • stock_quantity (integer, default 0)
  • category_id (Foreign Key referencing Categories.category_id)

Click for Solution

This problem demonstrates how to use CREATE TABLE and define various types of constraints, both inline and externally, as discussed in the DDL section.

-- First, create the Categories table as Products will reference it
CREATE TABLE Categories (
    category_id INT PRIMARY KEY,
    category_name VARCHAR(100) NOT NULL UNIQUE
);

-- Now, create the Products table
CREATE TABLE Products (
    product_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- PostgreSQL syntax for auto-increment PK
    -- (For MySQL, use INT AUTO_INCREMENT PRIMARY KEY)
    -- (For SQL Server, use INT IDENTITY(1,1) PRIMARY KEY)
    product_name VARCHAR(255) NOT NULL UNIQUE,
    price DECIMAL(10, 2) NOT NULL CHECK (price > 0), -- Inline CHECK constraint
    stock_quantity INT DEFAULT 0, -- Inline DEFAULT constraint
    category_id INT,
    -- External Foreign Key constraint for clarity
    CONSTRAINT fk_category
        FOREIGN KEY (category_id) REFERENCES Categories(category_id)
);

Explanation:

  • PRIMARY KEY is defined inline for category_id and product_id.
  • NOT NULL and UNIQUE are also inline for category_name and product_name.
  • CHECK (price > 0) ensures that no product can have a non-positive price.
  • DEFAULT 0 sets the initial stock quantity if not specified.
  • The FOREIGN KEY constraint for category_id in the Products table is defined externally for better readability and to give it a specific name (`fk_category`), linking it to the category_id in the Categories table.

🎯 Practice: Modifying Table Structures with ALTER TABLE

You have the Products table created previously. Now, a new requirement comes in:

  1. Add a new column called description of type TEXT (which can be NULL) to the Products table.
  2. It's realized that product names might be longer than 255 characters, so you need to increase the length of the product_name column to VARCHAR(500).

Click for Solution

This exercise uses ALTER TABLE to modify an existing table's structure by adding a column and changing an existing column's data type, highlighting potential risks.

-- 1. Add the 'description' column
ALTER TABLE Products
ADD COLUMN description TEXT;

-- 2. Increase the length of 'product_name'
-- Syntax varies slightly by RDBMS:

-- PostgreSQL:
ALTER TABLE Products
ALTER COLUMN product_name TYPE VARCHAR(500);

-- SQL Server:
ALTER TABLE Products
ALTER COLUMN product_name VARCHAR(500);

-- MySQL:
ALTER TABLE Products
MODIFY COLUMN product_name VARCHAR(500);

Explanation:

  • ALTER TABLE Products ADD COLUMN description TEXT; simply appends a new column. Since NOT NULL isn't specified, it defaults to allowing NULLs.
  • Changing data types (like VARCHAR(255) to VARCHAR(500)) is generally safe when increasing length. However, if you were to decrease the length, you would risk truncating existing data if any product names exceeded the new, smaller limit. Always consider existing data when modifying column definitions.

🎯 Practice: Removing Columns and Adding Composite Constraints

Continuing with the Products table:

  1. The business decides that the stock_quantity column is managed by an external inventory system and is no longer needed in the database. Remove this column.
  2. To prevent duplicate products within the same category (e.g., two products named 'Milk' in the 'Dairy' category but not across categories), add a composite UNIQUE constraint on product_name and category_id.

Click for Solution

This problem uses ALTER TABLE to remove a column and add a composite constraint, reinforcing the concept of multi-column constraints.

-- 1. Drop the 'stock_quantity' column
ALTER TABLE Products
DROP COLUMN stock_quantity;

-- 2. Add a composite UNIQUE constraint
-- Note: If 'product_name' was already UNIQUE on its own, this would be redundant
-- but demonstrates composite constraint syntax.
ALTER TABLE Products
ADD CONSTRAINT UQ_ProductCategoryName UNIQUE (product_name, category_id);

Explanation:

  • DROP COLUMN stock_quantity permanently removes the column and all its data. This is irreversible unless you have a backup.
  • ADD CONSTRAINT UQ_ProductCategoryName UNIQUE (product_name, category_id) creates a unique constraint across *both* columns. This means the combination of `product_name` and `category_id` must be unique. You could have "Milk" in Category A and "Milk" in Category B, but not two "Milk" products in Category A.

🎯 Practice: Understanding DELETE, TRUNCATE, and DROP

Imagine you have a table called AuditLogs with millions of records. Describe the SQL commands you would use for the following scenarios, and explain why each is appropriate:

  1. Remove all audit logs specifically older than 6 months, while allowing the operation to be easily undone if a mistake is made within a transaction.
  2. You need to completely clear all data from the AuditLogs table to start fresh, but you want to keep the table structure, indexes, and constraints. This operation should be as fast as possible for a very large table.
  3. The AuditLogs feature is being completely decommissioned, and the table, along with all its data, is no longer needed in the database.

Click for Solution

This problem directly tests your understanding of the critical differences between DELETE, TRUNCATE, and DROP, especially in terms of data removal, rollback capability, and performance.

-- Scenario 1: Remove specific old records, with rollback capability.
-- Assuming 'log_date' is the timestamp column.
BEGIN TRANSACTION; -- Start a transaction (syntax might vary: START TRANSACTION;)

DELETE FROM AuditLogs
WHERE log_date < CURRENT_DATE - INTERVAL '6 MONTH'; -- PostgreSQL example (adjust for your RDBMS)
-- For SQL Server: WHERE log_date < DATEADD(month, -6, GETDATE());
-- For MySQL: WHERE log_date < DATE_SUB(CURDATE(), INTERVAL 6 MONTH);

-- If you realize you made a mistake:
-- ROLLBACK;

-- If satisfied:
-- COMMIT;

Explanation for Scenario 1 (DELETE):

  • DELETE is a DML command, meaning it operates row by row and is fully logged.
  • It allows a WHERE clause to target specific records.
  • Crucially, it respects transactions, so if wrapped in a BEGIN TRANSACTION and COMMIT/ROLLBACK, the changes can be undone if an error occurs or the decision needs to be reversed. This makes it ideal for targeted, reversible deletions. However, for millions of rows, it can be slower and consume more transaction log space than TRUNCATE.

-- Scenario 2: Clear all data, keep structure, fastest way.
TRUNCATE TABLE AuditLogs;

Explanation for Scenario 2 (TRUNCATE):

  • TRUNCATE TABLE is a DDL command. It quickly removes all rows from a table by deallocating the data pages, making it extremely fast for large tables.
  • It keeps the table's structure, indexes, and constraints intact.
  • In most RDBMS, TRUNCATE causes an implicit commit, meaning it cannot be rolled back. It also typically resets any auto-incrementing identity columns. This is suitable when you need to completely empty a table for a fresh start and don't need to undo the operation.

-- Scenario 3: Decommission the entire table.
DROP TABLE AuditLogs;

Explanation for Scenario 3 (DROP):

  • DROP TABLE is a DDL command. It removes the entire table definition from the database, including all its data, indexes, constraints, and triggers.
  • Like TRUNCATE, it implicitly commits and cannot be rolled back.
  • Use this when the table and its data are permanently no longer needed.

5. Data Manipulation Language (DML) & Querying

Data Manipulation Language (DML) commands are the core of interacting with your data. These commands allow you to add, modify, delete, and retrieve data stored within your database tables.

The Distinction: DML operates on the data within the defined structures (tables), while DDL (covered previously) operates on the structures themselves. DML operations are typically transactional and can be rolled back.

Modifying Data

INSERT (Adding new rows)

The INSERT statement adds one or more new rows of data into a table. You can insert a single row or multiple rows in a single command.

-- Single Row Insert (specifying columns)
INSERT INTO Employees (employee_id, first_name, last_name, email, hire_date, salary, department_id)
VALUES (101, 'Alice', 'Smith', 'alice@example.com', '2023-01-15', 75000.00, 1);

-- Single Row Insert (all columns, order must match table definition)
INSERT INTO Employees
VALUES (102, 'Bob', 'Johnson', 'bob@example.com', '2023-02-20', 82000.00, 2, NULL); -- Assuming a 'phone_number' column, which is NULL

-- Bulk Insert (multiple rows)
INSERT INTO Employees (employee_id, first_name, last_name, email, hire_date, salary, department_id)
VALUES
    (103, 'Charlie', 'Brown', 'charlie@example.com', '2023-03-01', 68000.00, 1),
    (104, 'Diana', 'Prince', 'diana@example.com', '2023-03-10', 95000.00, 3);

-- Inserting from another SELECT query (e.g., from a staging table)
INSERT INTO ProductionTable (col1, col2)
SELECT staging_col1, staging_col2
FROM StagingTable
WHERE status = 'processed';

UPDATE (Modifying existing rows)

The UPDATE statement modifies existing data in a table. It is absolutely critical to use a WHERE clause with UPDATE to specify which rows should be affected.

The importance of the WHERE clause:

If you omit the WHERE clause in an UPDATE statement, all rows in the table will be updated. This is a common and dangerous mistake that can lead to catastrophic data loss or corruption.

-- Update a single employee's salary
UPDATE Employees
SET salary = 80000.00
WHERE employee_id = 101;

-- Update multiple employees in a specific department
UPDATE Employees
SET salary = salary * 1.05 -- Give a 5% raise
WHERE department_id = 1 AND hire_date < '2023-01-01';

-- Update a column to NULL
UPDATE Employees
SET email = NULL
WHERE employee_id = 103;

DELETE (Removing existing rows)

The DELETE statement removes one or more rows from a table. Like UPDATE, the WHERE clause is paramount.

The importance of the WHERE clause:

If you omit the WHERE clause in a DELETE statement, all rows in the table will be deleted. This is also a common mistake with severe consequences.

-- Delete a single employee record
DELETE FROM Employees
WHERE employee_id = 102;

-- Delete all employees from a specific department
DELETE FROM Employees
WHERE department_id = 3;

-- Delete all employees hired before a certain date
DELETE FROM Employees
WHERE hire_date < '2022-01-01';

The SELECT Statement Deep Dive

The SELECT statement is the most frequently used DML command, used for retrieving data from a database. Mastering its various clauses is essential.

Selection Logic (CASE Statements)

CASE statements allow you to apply conditional logic directly within your query, creating new derived columns based on conditions. This is incredibly powerful for data transformation and reporting.

SELECT
    employee_id,
    first_name,
    salary,
    CASE
        WHEN salary < 60000 THEN 'Junior'
        WHEN salary BETWEEN 60000 AND 90000 THEN 'Mid-Level'
        WHEN salary > 90000 THEN 'Senior'
        ELSE 'Unknown' -- Optional, but good practice for completeness
    END AS career_level,
    CASE department_id
        WHEN 1 THEN 'Marketing'
        WHEN 2 THEN 'Sales'
        WHEN 3 THEN 'Engineering'
        ELSE 'Other'
    END AS department_name
FROM
    Employees;

Key points:

  • CASE statements evaluate conditions sequentially. The first WHEN condition that evaluates to TRUE will have its corresponding THEN value returned.
  • If no WHEN conditions are met, the value in the ELSE clause is returned. If ELSE is omitted and no conditions are met, NULL is returned.
  • There are two main forms:
    • Searched CASE: CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ... END (More flexible, conditions can be anything).
    • Simple CASE: CASE column_name WHEN value1 THEN result1 WHEN value2 THEN result2 ... END (Compares a single expression to various values).

Filtering (WHERE)

The WHERE clause filters rows based on specified conditions before any grouping or aggregation (remember the order of execution!).

  • Operator Precedence (`AND` vs. `OR`):
    • AND has higher precedence than OR. This means conditions joined by AND are evaluated first.
    • Use parentheses to explicitly control the order of evaluation and avoid unexpected results.
-- Example: (salary > 70000 AND department_id = 1) OR hire_date > '2023-06-01'
-- Employees earning >70k in Dept 1, OR anyone hired after June 1, 2023.
SELECT *
FROM Employees
WHERE (salary > 70000 AND department_id = 1) OR hire_date > '2023-06-01';
  • Pattern Matching (`LIKE` with `%` and `_`):
    • %: Matches any sequence of zero or more characters.
    • _: Matches any single character.
-- Find employees whose first name starts with 'A'
SELECT * FROM Employees WHERE first_name LIKE 'A%';

-- Find employees whose last name has 'son' anywhere in it
SELECT * FROM Employees WHERE last_name LIKE '%son%';

-- Find employees whose email is exactly 5 characters long before '@example.com'
SELECT * FROM Employees WHERE email LIKE '_____@example.com';
  • List Filtering (`IN` vs. `NOT IN`):
    • IN: Matches any value in a specified list.
    • NOT IN: Matches any value *not* in a specified list.
-- Find employees in departments 1 or 3
SELECT * FROM Employees WHERE department_id IN (1, 3);

-- Find employees NOT in departments 1, 2, or 3
SELECT * FROM Employees WHERE department_id NOT IN (1, 2, 3);
`NOT IN` and NULLs: If the list used with NOT IN contains a NULL value (e.g., NOT IN (1, 2, NULL)), the entire NOT IN condition will always evaluate to UNKNOWN for all rows, effectively returning zero rows. Be very careful with NULLs in NOT IN lists.
  • Range Filtering (`BETWEEN` inclusive boundaries):
    • BETWEEN value1 AND value2: Matches values within the specified range, *inclusive* of both value1 and value2.
-- Find employees with salary between 60,000 and 80,000 (inclusive)
SELECT * FROM Employees WHERE salary BETWEEN 60000.00 AND 80000.00;

-- Find employees hired in Q1 2023
SELECT * FROM Employees WHERE hire_date BETWEEN '2023-01-01' AND '2023-03-31';
  • Edge Case: Filtering NULLs (`IS NULL` vs. `= NULL`):
    • As discussed in Section 2, direct comparison operators like =, !=, <, > with NULL always result in UNKNOWN.
    • To find rows where a column is NULL, use IS NULL.
    • To find rows where a column is not NULL, use IS NOT NULL.
-- Correctly find employees with no assigned phone number
SELECT * FROM Employees WHERE phone_number IS NULL;

-- Correctly find employees who DO have an assigned phone number
SELECT * FROM Employees WHERE phone_number IS NOT NULL;

-- This will NOT work as expected (returns no rows, because NULL = NULL is UNKNOWN)
-- SELECT * FROM Employees WHERE phone_number = NULL;

Sorting & Limits

Once rows are filtered, you often want to present them in a specific order and possibly retrieve only a subset.

  • `ORDER BY` (Multi-level sorting):
    • Sorts the result set by one or more columns in ascending (`ASC`, default) or descending (`DESC`) order.
    • You can sort by multiple columns, with later columns acting as tie-breakers.
-- Order by last name ascending, then first name ascending
SELECT *
FROM Employees
ORDER BY last_name ASC, first_name ASC;

-- Order by salary descending, then hire date descending (most recent first)
SELECT *
FROM Employees
ORDER BY salary DESC, hire_date DESC;
  • Handling NULLs in sort order (`NULLS FIRST`/`LAST`):
    • Many RDBMS (like PostgreSQL, Oracle) allow you to specify how NULL values should be treated in sorting (e.g., appear first or last). SQL Server and MySQL handle NULLs differently by default (often treating them as the lowest values).
-- PostgreSQL example: Employees with NULL phone numbers appear first when sorting by phone number
SELECT employee_id, first_name, phone_number
FROM Employees
ORDER BY phone_number ASC NULLS FIRST;

-- PostgreSQL example: Employees with NULL phone numbers appear last
SELECT employee_id, first_name, phone_number
FROM Employees
ORDER BY phone_number ASC NULLS LAST;
  • `LIMIT`/`OFFSET` (Pagination basics):
    • LIMIT n: Restricts the number of rows returned by the query to `n`.
    • OFFSET m: Skips `m` rows before beginning to return the rows. Used for pagination.
    • Syntax varies:
      • LIMIT x OFFSET y (PostgreSQL, MySQL, SQLite)
      • TOP x (SQL Server)
      • FETCH FIRST x ROWS ONLY OFFSET y ROWS (SQL Standard, Oracle 12c+)
-- Get the 5 highest-paid employees
SELECT *
FROM Employees
ORDER BY salary DESC
LIMIT 5; -- MySQL/PostgreSQL

-- For SQL Server:
-- SELECT TOP 5 * FROM Employees ORDER BY salary DESC;

-- Get the next 5 highest-paid employees (e.g., for page 2 of results)
SELECT *
FROM Employees
ORDER BY salary DESC
LIMIT 5 OFFSET 5; -- MySQL/PostgreSQL (skips the first 5, gets the next 5)

-- For SQL Server (using OFFSET FETCH):
-- SELECT * FROM Employees ORDER BY salary DESC OFFSET 5 ROWS FETCH NEXT 5 ROWS ONLY;

Practice & Application

🎯 Practice: Managing Employee Data with DML

Perform the following operations on the Employees table:

  1. Insert a new employee: Employee ID 105, John Doe, john.doe@example.com, hired today, $70,000 salary, Department ID 1, no phone number.
  2. Update employee 101 (Alice Smith) to give her a 10% raise.
  3. Delete any employee whose email is NULL (assuming such records exist for some reason).
Click for Solution

This problem demonstrates the core DML commands: INSERT, UPDATE, and DELETE, emphasizing the importance of the WHERE clause.

-- Assume Employees table structure:
-- (employee_id INT, first_name VARCHAR, last_name VARCHAR, email VARCHAR,
--  hire_date DATE, salary DECIMAL, department_id INT, phone_number VARCHAR)

-- 1. Insert a new employee
INSERT INTO Employees (employee_id, first_name, last_name, email, hire_date, salary, department_id, phone_number)
VALUES (105, 'John', 'Doe', 'john.doe@example.com', CURRENT_DATE, 70000.00, 1, NULL);

-- 2. Update employee 101 (Alice Smith) with a 10% raise
UPDATE Employees
SET salary = salary * 1.10
WHERE employee_id = 101;

-- 3. Delete employees with a NULL email
DELETE FROM Employees
WHERE email IS NULL; -- Correctly using IS NULL for filtering NULLs

Explanation:

  • INSERT uses CURRENT_DATE (or equivalent like GETDATE()/CURDATE()) to record today's date. Specifying NULL for `phone_number` explicitly sets it to missing.
  • The UPDATE statement correctly uses `employee_id = 101` in its WHERE clause to target only Alice Smith. Omitting this would have given *everyone* a 10% raise!
  • The DELETE statement correctly uses email IS NULL. Remember, email = NULL would never return TRUE.

🎯 Practice: Categorizing Employees with CASE Statements

Retrieve a list of all employees, their current salary, and a new column named SalaryCategory. The SalaryCategory should be defined as follows:

  • 'Low' if salary is less than $60,000
  • 'Medium' if salary is between $60,000 and $90,000 (inclusive)
  • 'High' if salary is greater than $90,000
  • 'Undetermined' if salary is NULL or somehow doesn't fit (though not expected with current logic)

Click for Solution

This problem utilizes the CASE statement to introduce conditional logic into the SELECT clause, creating a derived attribute based on existing data.

SELECT
    employee_id,
    first_name,
    last_name,
    salary,
    CASE
        WHEN salary IS NULL THEN 'Undetermined' -- Handle NULL first if desired
        WHEN salary < 60000 THEN 'Low'
        WHEN salary BETWEEN 60000 AND 90000 THEN 'Medium'
        WHEN salary > 90000 THEN 'High'
        ELSE 'Undetermined' -- Catch-all, though previous WHENs should cover most
    END AS SalaryCategory
FROM
    Employees
ORDER BY
    salary DESC;

Explanation:

  • The CASE statement evaluates conditions in order. It's often good practice to handle NULL cases first if they should take precedence.
  • BETWEEN 60000 AND 90000 is inclusive, covering salaries at both ends of the range.
  • The ELSE clause ensures that every row will have a SalaryCategory, preventing NULLs in the new column unless specifically desired.

🎯 Practice: Complex Filtering with WHERE and Pattern Matching

Find all employees who meet the following criteria:

  1. Their first name starts with 'A' OR their last name contains 'o'.
  2. AND they were hired in 2023.
  3. AND their department ID is either 1 or 3.
Click for Solution

This exercise combines multiple filtering conditions using AND, OR, LIKE, and IN, requiring careful consideration of operator precedence.

SELECT
    employee_id,
    first_name,
    last_name,
    hire_date,
    department_id
FROM
    Employees
WHERE
    (first_name LIKE 'A%' OR last_name LIKE '%o%') -- Parentheses are CRUCIAL here due to AND/OR precedence
    AND hire_date BETWEEN '2023-01-01' AND '2023-12-31'
    AND department_id IN (1, 3)
ORDER BY
    hire_date ASC;

Explanation:

  • (first_name LIKE 'A%' OR last_name LIKE '%o%'): Parentheses are essential. Without them, AND hire_date... would be evaluated with only `last_name LIKE '%o%'` first, potentially leading to incorrect results. This ensures the name condition is treated as a single unit.
  • hire_date BETWEEN '2023-01-01' AND '2023-12-31': Efficiently filters for the entire year 2023.
  • department_id IN (1, 3): Filters for specific department IDs.
  • All three main conditions are linked by AND, meaning all must be true for a row to be returned.

🎯 Practice: Paginated Employee List with NULLs Ordering

Retrieve a paginated list of employees, showing the second page of results, with 5 employees per page. Sort the employees primarily by department_id in ascending order. If two employees are in the same department, sort them by phone_number in descending order, ensuring that employees with no phone number (NULL) appear at the very end of their department group.

Click for Solution

This problem combines multi-level sorting, explicit NULL handling in sorting, and pagination using LIMIT and OFFSET.

SELECT
    employee_id,
    first_name,
    department_id,
    phone_number
FROM
    Employees
ORDER BY
    department_id ASC,
    phone_number DESC NULLS LAST -- Sort phone numbers DESC, with NULLs at the end
LIMIT 5 OFFSET 5; -- For Page 2, LIMIT 5 (rows per page), OFFSET 5 (skip first page)

Explanation:

  • ORDER BY department_id ASC: The primary sort key, arranging employees by department.
  • phone_number DESC NULLS LAST: The secondary sort key. It sorts phone numbers in descending order (e.g., '999-...' before '111-...'), but specifically places any NULL phone numbers at the very end within each department group. This relies on RDBMS support for NULLS LAST (like PostgreSQL/Oracle). For MySQL/SQL Server, you might need a CASE statement in ORDER BY for explicit NULL ordering.
  • LIMIT 5 OFFSET 5: This is for pagination. LIMIT 5 requests 5 rows, and OFFSET 5 skips the first 5 rows (effectively giving you rows 6-10, which is the second page).

6. Aggregation, Grouping & Granularity

While SELECT retrieves individual rows, often you need to summarize data. This is where aggregation and grouping come into play, allowing you to derive insights from collections of rows rather than just individual ones.

The Big Idea: Aggregation transforms multiple rows into a single summary row. Grouping specifies the criteria for which rows belong to the same "summary group."

Aggregate Functions

Aggregate functions (also known as "group functions") perform a calculation on a set of rows and return a single value.

  • COUNT(*) vs. COUNT(column_name) (Handling NULLs):
    • COUNT(*): Counts all rows in the result set, including those with NULL values in any column. It literally counts "how many rows are there?".
    • COUNT(column_name): Counts only the non-NULL values in the specified column_name. It counts "how many values are present in this column?".
    • COUNT(DISTINCT column_name): Counts the number of unique non-NULL values in the specified column.
-- Assume Employees table has some NULL emails and phone numbers.

-- Total number of employees (rows)
SELECT COUNT(*) AS total_employees
FROM Employees; -- e.g., 100

-- Number of employees with a recorded email address
SELECT COUNT(email) AS employees_with_email
FROM Employees; -- e.g., 95 (if 5 emails are NULL)

-- Number of unique department IDs
SELECT COUNT(DISTINCT department_id) AS unique_departments
FROM Employees;
  • SUM(), AVG():
    • SUM(column_name): Calculates the total sum of non-NULL values in a numeric column.
    • AVG(column_name): Calculates the average of non-NULL values in a numeric column.
-- Total salary paid across all employees
SELECT SUM(salary) AS total_payroll
FROM Employees;

-- Average salary of all employees
SELECT AVG(salary) AS average_salary
FROM Employees;
  • MIN(), MAX() (Works on Strings/Dates too):
    • MIN(column_name): Returns the smallest value in the column (numerically, alphabetically, or chronologically).
    • MAX(column_name): Returns the largest value in the column (numerically, alphabetically, or chronologically).
-- Highest and lowest salaries
SELECT MAX(salary) AS highest_salary, MIN(salary) AS lowest_salary
FROM Employees;

-- Earliest and latest hire dates
SELECT MIN(hire_date) AS earliest_hire, MAX(hire_date) AS latest_hire
FROM Employees;

-- Alphabetically first and last last names
SELECT MIN(last_name) AS first_last_name, MAX(last_name) AS last_last_name
FROM Employees;

Grouping Mechanics (`GROUP BY`)

The GROUP BY clause divides the rows returned by the FROM and WHERE clauses into groups, and then an aggregate function is applied to each group. This allows you to get summaries *per category*.

  • The rule of "Non-aggregated columns must be grouped":

    Any column that appears in the SELECT list that is *not* an aggregate function (like COUNT(), SUM(), etc.) must also appear in the GROUP BY clause. This ensures that the database knows how to categorize and summarize the data.

    -- Valid: Calculate average salary PER department
    SELECT
        department_id,          -- This is not aggregated
        AVG(salary) AS avg_salary_by_dept
    FROM
        Employees
    GROUP BY
        department_id;          -- So it MUST be in GROUP BY
    
    -- Invalid (will cause an error): Trying to select first_name without grouping it
    -- SELECT department_id, first_name, AVG(salary) FROM Employees GROUP BY department_id;
    -- Error: column "Employees.first_name" must appear in the GROUP BY clause or be used in an aggregate function
  • Grouping by expressions or derived values:

    You can group by any column, or even an expression or derived value (like a CASE statement or a part of a date).

    -- Group by hire year (PostgreSQL/MySQL example)
    SELECT
        EXTRACT(YEAR FROM hire_date) AS hire_year, -- Extract year
        COUNT(*) AS total_hires
    FROM
        Employees
    GROUP BY
        EXTRACT(YEAR FROM hire_date); -- Group by the extracted year
    
    -- Group by a CASE statement's output (e.g., career_level from previous section)
    SELECT
        CASE
            WHEN salary < 60000 THEN 'Low'
            WHEN salary BETWEEN 60000 AND 90000 THEN 'Medium'
            WHEN salary > 90000 THEN 'High'
            ELSE 'Undetermined'
        END AS SalaryCategory,
        COUNT(*) AS num_employees
    FROM
        Employees
    GROUP BY
        SalaryCategory; -- Grouping by the alias is often supported as well, or the full CASE expression

Filtering Groups (`HAVING`)

The HAVING clause is used to filter groups based on conditions applied to aggregate functions. It operates after GROUP BY.

  • WHERE vs. HAVING (Pre-aggregation vs. Post-aggregation):
    • WHERE: Filters individual rows before GROUP BY is applied. It cannot use aggregate functions.
    • HAVING: Filters entire groups after GROUP BY is applied and aggregate functions have been calculated. It *must* use aggregate functions (or columns from the GROUP BY clause).
-- Find departments where the average salary is above 70,000,
-- considering only employees hired after 2022.
SELECT
    department_id,
    AVG(salary) AS avg_dept_salary,
    COUNT(*) AS num_employees_in_group
FROM
    Employees
WHERE
    hire_date > '2022-12-31' -- Filters individual employees BEFORE grouping
GROUP BY
    department_id
HAVING
    AVG(salary) > 70000; -- Filters groups AFTER aggregation

Funnel Chart: Data Reduction from FROM → WHERE → GROUP BY → HAVING

This visual demonstrates how each clause reduces the data set, narrowing down the focus from all raw data to the specific insights you need.

FROM: All Raw Data (e.g., 1000 Employees)
WHERE: Filtered Rows (e.g., 750 Employees Hired After 2022)
GROUP BY: Aggregated Groups (e.g., 10 Department Groups)
HAVING: Filtered Groups (e.g., 3 Departments with Avg Salary > 70K)
This funnel illustrates the logical flow: the database first gathers all data (FROM), then discards individual rows not meeting criteria (WHERE), then groups the remaining rows (GROUP BY), and finally discards entire groups not meeting aggregate criteria (HAVING).

7. Joins & Relational Mechanics

Databases are powerful because they allow you to store related pieces of information in separate, organized tables. Joins are the cornerstone of the relational model, enabling you to combine data from two or more tables based on related columns.

The Core Idea: Joins bring together data that is logically connected, allowing you to answer complex business questions that span multiple entities (e.g., "Show me all orders made by customers in New York").

Join Theory

A great way to conceptualize different join types is through Venn Diagrams, which illustrate how rows from two sets (tables) are combined.

Venn Diagrams: Interpreting Join Types

INNER JOIN (Intersection)

Returns only the rows that have matching values in both tables.

A B

LEFT JOIN (All Left + Matching Right)

<--

Returns all rows from the left table, and the matching rows from the right table. If there's no match on the right, NULLs appear for the right table's columns.

A + (A ∩ B)

RIGHT JOIN (All Right + Matching Left)

-->

Returns all rows from the right table, and the matching rows from the left table. If there's no match on the left, NULLs appear for the left table's columns.

B + (A ∩ B)

FULL OUTER JOIN (Union of Both)

Returns all rows when there is a match in one of the tables. If no match, NULLs appear for the non-matching side. It's the union of LEFT JOIN and RIGHT JOIN.

A B

Join Types & Behaviors

Let's look at the practical syntax and typical use cases.

-- Example Tables:
-- Customers (customer_id, customer_name)
-- Orders (order_id, customer_id, order_date, total_amount)
  • INNER JOIN (Intersection):

    Combines rows from two tables where the join condition is true. Rows that do not have a match in both tables are excluded.

    SELECT C.customer_name, O.order_id, O.total_amount
    FROM Customers C
    INNER JOIN Orders O ON C.customer_id = O.customer_id;
    -- Shows only customers who have placed orders, and their orders.
  • LEFT JOIN (Preserving left table, NULLs on right):

    Returns all rows from the left table (Customers), and the matching rows from the right table (Orders). If there is no matching order for a customer, the order-related columns will be NULL.

    SELECT C.customer_name, O.order_id, O.total_amount
    FROM Customers C
    LEFT JOIN Orders O ON C.customer_id = O.customer_id;
    -- Shows ALL customers, including those with no orders (order_id, total_amount will be NULL for them).
  • RIGHT JOIN (Preserving right table):

    Returns all rows from the right table (Orders), and the matching rows from the left table (Customers). If an order has no matching customer (e.g., bad data), the customer-related columns will be NULL.

    SELECT C.customer_name, O.order_id, O.total_amount
    FROM Customers C
    RIGHT JOIN Orders O ON C.customer_id = O.customer_id;
    -- Shows ALL orders, even if they don't have a matching customer (customer_name will be NULL).
    -- Less common as LEFT JOIN can often achieve the same by swapping tables.
  • FULL OUTER JOIN (Union of both):

    Returns all rows when there is a match in either the left or right table. If a row in one table doesn't have a match in the other, the non-matching side will have NULLs.

    SELECT C.customer_name, O.order_id, O.total_amount
    FROM Customers C
    FULL OUTER JOIN Orders O ON C.customer_id = O.customer_id;
    -- Shows all customers (with or without orders) AND all orders (with or without customers).
    -- Where there's no match, the respective columns will be NULL.
  • CROSS JOIN (Cartesian Product - Warning signs):

    Combines every row from the first table with every row from the second table. If table A has $N$ rows and table B has $M$ rows, the result will have $N \times M$ rows. It does not require an `ON` clause.

    SELECT C.customer_name, O.order_id
    FROM Customers C
    CROSS JOIN Orders O;
    -- Produces every possible combination of a customer and an order.
    -- Use with extreme caution, often signals an error if not intended (e.g., generating all possible pairings).
  • SELF JOIN (Hierarchical data):

    Joining a table to itself. This is useful for comparing rows within the same table, often for hierarchical data (e.g., finding employees and their managers in an Employees table).

    -- Assume Employees table has: employee_id, employee_name, manager_id
    SELECT
        E.employee_name AS Employee,
        M.employee_name AS Manager
    FROM
        Employees E
    LEFT JOIN
        Employees M ON E.manager_id = M.employee_id;
    -- Links each employee to their manager (or NULL if no manager).

Join Syntax & Best Practices

  • ON (General conditions) vs. USING (Named columns):
    • ON: The most flexible way to specify join conditions. It allows any valid condition (equality, inequality, comparison operators, functions, etc.).
      FROM Customers C JOIN Orders O ON C.customer_id = O.customer_id AND O.order_date > '2023-01-01'
    • USING: A shorthand for INNER JOIN or LEFT JOIN when you are joining on columns that have the same name in both tables.
      FROM Customers JOIN Orders USING (customer_id)
      USING is syntactically cleaner when applicable, but less flexible than ON. The column specified in USING appears only once in the result set, unlike ON where both join columns from each table are retained.
  • Aliasing tables for readability:

    Assigning short aliases (e.g., C for Customers, O for Orders) to tables makes queries shorter and much more readable, especially with multiple joins or self-joins.

    SELECT C.customer_name, O.order_date -- Use aliases to specify which table a column comes from
    FROM Customers AS C -- 'AS' is optional but good practice
    INNER JOIN Orders AS O
    ON C.customer_id = O.customer_id;

Common Pitfalls

  • The "Fan-out" problem (Duplicate rows after join):

    This occurs when a row in one table matches multiple rows in another table, leading to an explosion of duplicate (or near-duplicate) rows. This is especially problematic before aggregation, as it can inflate counts or sums.

    Example: Customer with Multiple Phone Numbers

    Imagine a Customers table and a CustomerPhones table (one-to-many relationship). If a customer has two phone numbers and you join these tables, that customer's details will appear twice for each phone number. If you then try to COUNT customers, you'll get an inflated number.

    Customers (C) CustomerPhones (CP) Result of SELECT * FROM C JOIN CP ON C.cust_id = CP.cust_id
    cust_id: 1, name: Alice phone_id: 101, cust_id: 1, number: 555-1234 cust_id: 1, name: Alice, phone_id: 101, number: 555-1234
    phone_id: 102, cust_id: 1, number: 555-5678 cust_id: 1, name: Alice, phone_id: 102, number: 555-5678
    cust_id: 2, name: Bob phone_id: 201, cust_id: 2, number: 555-9876 cust_id: 2, name: Bob, phone_id: 201, number: 555-9876

    If you then do COUNT(DISTINCT C.cust_id), it will still be correct (2 customers). But if you did COUNT(*) on the joined result, you'd get 3, which is wrong if you're trying to count *customers*.

    Solution: Be mindful of relationship cardinality. Use DISTINCT with aggregate functions where appropriate, or aggregate in subqueries/CTEs *before* joining.

  • Joining on NULL values:

    Remember from Section 2 that NULL = NULL evaluates to UNKNOWN. This means that rows where the join column is NULL in both tables will not match using a standard equality join condition (ON T1.col = T2.col).

    -- If a customer's customer_id is NULL in Customers, and an order's customer_id is NULL in Orders,
    -- they will NOT join using: ON C.customer_id = O.customer_id
    -- To explicitly join on NULLs (rare but possible), you'd need:
    -- ON C.customer_id = O.customer_id OR (C.customer_id IS NULL AND O.customer_id IS NULL)

Practice & Application

🎯 Practice: Customers and Their Orders (INNER JOIN)

You have two tables: Customers (customer_id, customer_name) and Orders (order_id, customer_id, total_amount). Retrieve a list of all customers who have placed at least one order, along with the details of those orders (order ID and total amount). Only include customers who actually have orders.

Click for Solution

This is a classic use case for an INNER JOIN. It returns only the rows where there's a match in both the Customers and Orders tables based on their common customer_id.

-- Assume Customer and Orders tables exist with sample data:
-- Customers: { (1, 'Alice'), (2, 'Bob'), (3, 'Charlie') }
-- Orders:    { (101, 1, 50.00), (102, 2, 75.00), (103, 1, 120.00) }

SELECT
    C.customer_name,
    O.order_id,
    O.total_amount
FROM
    Customers C
INNER JOIN
    Orders O ON C.customer_id = O.customer_id;

Expected Output Logic:

customer_name order_id total_amount
Alice10150.00
Alice103120.00
Bob10275.00

Explanation:

  • INNER JOIN ensures that only customers with corresponding entries in the Orders table (and vice-versa) are included in the result.
  • Customer 'Charlie' (ID 3) does not appear because there are no orders with customer_id = 3.

🎯 Practice: All Customers and Their Orders (LEFT JOIN)

Using the same Customers and Orders tables, retrieve a list of all customers. For customers who have placed orders, include their order details. For customers who haven't placed any orders, still list them, but show NULLs for order-related fields.

Click for Solution

This scenario requires a LEFT JOIN. It prioritizes the left table (Customers) by returning all its rows, and only matching rows from the right table (Orders). Non-matching right-side columns will show NULL.

-- Assume Customers: { (1, 'Alice'), (2, 'Bob'), (3, 'Charlie') }
-- Orders:    { (101, 1, 50.00), (102, 2, 75.00), (103, 1, 120.00) }

SELECT
    C.customer_name,
    O.order_id,
    O.total_amount
FROM
    Customers C
LEFT JOIN
    Orders O ON C.customer_id = O.customer_id;

Expected Output Logic:

customer_name order_id total_amount
Alice10150.00
Alice103120.00
Bob10275.00
CharlieNULLNULL

Explanation:

  • LEFT JOIN ensures that all rows from the Customers table are present in the result.
  • Customer 'Charlie' appears, even without orders, because it exists in the left table. order_id and total_amount are NULL for Charlie.

🎯 Practice: All Orders and Their Customers (RIGHT JOIN)

Retrieve a list of all orders. For orders that have a matching customer, display the customer's name. If an order's customer_id does not exist in the Customers table (e.g., due to bad data or a deleted customer), still list the order, but show NULL for the customer's name.

Click for Solution

This is a perfect scenario for a RIGHT JOIN. It preserves all rows from the right table (Orders) and matches them with the left table (Customers) where possible.

-- Assume Customers: { (1, 'Alice'), (2, 'Bob'), (3, 'Charlie') }
-- Orders:    { (101, 1, 50.00), (102, 2, 75.00), (103, 1, 120.00), (104, 999, 200.00) } -- Order 104 has no matching customer

SELECT
    C.customer_name,
    O.order_id,
    O.total_amount
FROM
    Customers C
RIGHT JOIN
    Orders O ON C.customer_id = O.customer_id;

Expected Output Logic:

customer_name order_id total_amount
Alice10150.00
Bob10275.00
Alice103120.00
NULL104200.00

Explanation:

  • RIGHT JOIN ensures that all rows from the Orders table are present.
  • Order '104' appears, even though customer_id=999 doesn't exist in Customers. The customer_name is NULL for this row.
  • Note that you could achieve the same result with a LEFT JOIN by simply swapping the tables: FROM Orders O LEFT JOIN Customers C ON O.customer_id = C.customer_id;

🎯 Practice: Complete Customer-Order Relationship (FULL OUTER JOIN)

Retrieve a comprehensive list that includes all customers AND all orders. Customers without orders should appear with NULL order details. Orders without matching customers should appear with NULL customer details. Customers and orders with matches should show all their respective details.

Click for Solution

A FULL OUTER JOIN (often just FULL JOIN) achieves this by returning the union of both tables. It includes all rows from the left table and all rows from the right table, filling in NULLs where no match exists on either side.

-- Assume Customers: { (1, 'Alice'), (2, 'Bob'), (3, 'Charlie') }
-- Orders:    { (101, 1, 50.00), (102, 2, 75.00), (103, 1, 120.00), (104, 999, 200.00) }

SELECT
    C.customer_name,
    O.order_id,
    O.total_amount
FROM
    Customers C
FULL OUTER JOIN -- Or FULL JOIN in many systems
    Orders O ON C.customer_id = O.customer_id;

Expected Output Logic:

customer_name order_id total_amount
Alice10150.00
Alice103120.00
Bob10275.00
CharlieNULLNULL
NULL104200.00

Explanation:

  • This result set contains all matches, plus the non-matching rows from the Customers table (Charlie) and the non-matching rows from the Orders table (Order 104).

🎯 Practice: Finding Employees and Their Managers (SELF JOIN)

You have an Employees table with employee_id, employee_name, and manager_id (which refers to another employee_id in the same table). Retrieve a list showing each employee and the name of their manager. Employees without a manager should still be listed, showing NULL for the manager's name.

Click for Solution

A SELF JOIN is used here to join a table to itself. We treat the table as two logical instances: one for employees and one for managers. A LEFT JOIN ensures all employees are listed, even if they don't have a manager.

-- Assume Employees table:
-- { (1, 'Alice', NULL), (2, 'Bob', 1), (3, 'Charlie', 1), (4, 'David', 2) }
-- Alice (ID 1) has no manager. Bob and Charlie report to Alice. David reports to Bob.

SELECT
    E.employee_name AS EmployeeName,
    M.employee_name AS ManagerName
FROM
    Employees E -- Alias the table as 'E' for Employee
LEFT JOIN
    Employees M ON E.manager_id = M.employee_id; -- Alias as 'M' for Manager

Expected Output Logic:

EmployeeName ManagerName
AliceNULL
BobAlice
CharlieAlice
DavidBob

Explanation:

  • By aliasing the Employees table as E and M, we can refer to it as if it were two separate tables.
  • The join condition E.manager_id = M.employee_id links an employee's manager_id to their manager's employee_id.
  • The LEFT JOIN ensures that 'Alice', who has no manager (manager_id is NULL), is still included in the result with ManagerName as NULL.

🎯 Practice: Mitigating the Fan-Out Problem with Aggregation

You have Customers (customer_id, customer_name), Orders (order_id, customer_id), and a new table Addresses (address_id, customer_id, street). A customer can have multiple addresses.

Your goal is to count the total number of orders and the total number of distinct customers, then join this aggregated data with each customer's address information.

Click for Solution

This problem demonstrates the "fan-out" issue and how to resolve it by aggregating data *before* joining to tables that might cause row duplication. If we joined Customers to Addresses first, a customer with 3 addresses would appear 3 times, incorrectly inflating order counts if we then joined to Orders.

-- Assume tables:
-- Customers: { (1, 'Alice'), (2, 'Bob') }
-- Orders:    { (101, 1), (102, 1), (103, 2) }
-- Addresses: { (A1, 1, '123 Main St'), (A2, 1, '456 Oak Ave'), (A3, 2, '789 Pine Rd') }

-- Incorrect approach (Fan-out problem leading to inflated order counts):
-- If you first join Customers and Addresses, Alice would appear twice.
-- Then joining Orders would make Alice's orders count twice for each address.
/*
SELECT
    C.customer_name,
    A.street,
    COUNT(O.order_id) AS total_orders -- This would be inflated for Alice
FROM
    Customers C
LEFT JOIN Addresses A ON C.customer_id = A.customer_id
LEFT JOIN Orders O ON C.customer_id = O.customer_id
GROUP BY
    C.customer_id, C.customer_name, A.street;
*/

-- Correct Solution: Aggregate orders first, then join.
WITH CustomerOrderSummary AS (
    SELECT
        customer_id,
        COUNT(order_id) AS num_orders
    FROM
        Orders
    GROUP BY
        customer_id
)
SELECT
    C.customer_name,
    A.street,
    COS.num_orders -- Correct count per customer
FROM
    Customers C
LEFT JOIN
    Addresses A ON C.customer_id = A.customer_id
LEFT JOIN
    CustomerOrderSummary COS ON C.customer_id = COS.customer_id
ORDER BY
    C.customer_name, A.street;

Explanation:

  • CustomerOrderSummary CTE: This subquery (or CTE) first calculates the `num_orders` for each `customer_id` directly from the Orders table. This aggregation happens *before* any potential fan-out.
  • Main Query: We then join Customers with Addresses (which can still cause duplication of customer details if they have multiple addresses, but it's *expected* if you want to see each address). Crucially, we then join this result with our pre-aggregated CustomerOrderSummary.
  • This way, even if a customer has multiple addresses, the num_orders displayed will be the correct total for that customer, not an inflated value due to the addresses join. If you then wanted to count distinct customers, you'd use COUNT(DISTINCT C.customer_id) on the final result, or aggregate that separately before joining.

8. Subqueries, CTEs & Query Architecture

As SQL queries become more complex, simply chaining `JOIN` and `WHERE` clauses can become unwieldy. Subqueries and Common Table Expressions (CTEs) offer powerful ways to break down complex problems into smaller, more manageable parts, improving both readability and maintainability.

The Goal: To manage complexity. Subqueries and CTEs allow you to perform intermediate steps and logically structure your query, much like functions or variables in programming.

Subqueries (Nested Queries)

A subquery (or inner query) is a query nested inside another SQL query (the outer query). They can be used in various parts of a SQL statement.

  • Scalar Subquery: Returns one value

    A subquery that returns a single row and a single column. It can be used anywhere a single value is expected (e.g., in a SELECT list, WHERE clause comparison).

    -- Find employees whose salary is greater than the average salary
    SELECT
        employee_name,
        salary
    FROM
        Employees
    WHERE
        salary > (SELECT AVG(salary) FROM Employees); -- Scalar subquery returning one value
  • Column Subquery: Returns a list (`IN` clause)

    A subquery that returns a single column but multiple rows. It's commonly used with operators like IN, NOT IN, ANY, ALL.

    -- Find employees who work in departments located in 'New York'
    SELECT
        E.employee_name,
        D.department_name
    FROM
        Employees E
    JOIN
        Departments D ON E.department_id = D.department_id
    WHERE
        D.department_id IN (SELECT department_id FROM Departments WHERE location = 'New York'); -- Column subquery returning a list
  • Row/Table Subquery: Returns a virtual table (`FROM` clause)

    A subquery that returns multiple columns and multiple rows, essentially acting as a temporary table. These are often called derived tables when used in the FROM clause.

    -- Find employees whose salary is above their department's average
    SELECT
        E.employee_name,
        E.salary,
        DA.avg_dept_salary
    FROM
        Employees E
    JOIN
        (SELECT department_id, AVG(salary) AS avg_dept_salary
         FROM Employees
         GROUP BY department_id) AS DA -- Table subquery (derived table)
    ON E.department_id = DA.department_id
    WHERE
        E.salary > DA.avg_dept_salary;
  • Correlated Subquery: Dependent on outer query (Performance hit)

    A subquery that depends on the outer query for its values. It executes once for each row processed by the outer query, which can lead to poor performance on large datasets.

    -- Find employees who earn more than the average salary in their *own* department
    SELECT
        E.employee_name,
        E.salary,
        E.department_id
    FROM
        Employees E
    WHERE
        E.salary > (SELECT AVG(salary) FROM Employees WHERE department_id = E.department_id);
        -- The inner subquery depends on E.department_id from the outer query.
    Correlated subqueries are often less performant than joins or CTEs because they are re-evaluated for every row of the outer query. While sometimes necessary, consider alternatives like joins with derived tables or window functions.

Common Table Expressions (CTEs)

CTEs provide a way to define a temporary, named result set that you can reference within a single SQL statement (SELECT, INSERT, UPDATE, DELETE). They significantly improve query readability and can sometimes optimize execution.

  • Syntax: WITH cte_name AS ...
    WITH DepartmentAvgSalary AS (
        SELECT
            department_id,
            AVG(salary) AS avg_dept_salary
        FROM
            Employees
        GROUP BY
            department_id
    )
    SELECT
        E.employee_name,
        E.salary,
        DAS.avg_dept_salary
    FROM
        Employees E
    JOIN
        DepartmentAvgSalary DAS ON E.department_id = DAS.department_id
    WHERE
        E.salary > DAS.avg_dept_salary;
  • Readability benefits over subqueries:

    CTEs make complex queries much easier to read and understand by breaking them into logical, named steps. Instead of deeply nested subqueries, you define each step clearly at the beginning of the query.

  • Reusability within the same query:

    A single CTE can be referenced multiple times within the same subsequent query. You can also chain CTEs, where one CTE references a previously defined CTE.

    WITH HighEarners AS (
        SELECT employee_id, employee_name, salary
        FROM Employees
        WHERE salary > 90000
    ),
    EngineeringHighEarners AS (
        SELECT HE.employee_id, HE.employee_name, HE.salary, D.department_name
        FROM HighEarners HE
        JOIN Departments D ON HE.employee_id = D.employee_id -- (Assuming a joinable department_id in HighEarners)
        WHERE D.department_name = 'Engineering'
    )
    SELECT * FROM EngineeringHighEarners;
    Note: The example above is conceptual. For EngineeringHighEarners to join Departments, HighEarners would need to also include `department_id`.

Best Practices

Decision Tree for Choosing CTEs vs. Subqueries vs. Temp Tables

When should you use which technique?

Need intermediate result?
Yes
Query is short / single use in WHERE/SELECT?
Yes
Subquery (Scalar/Column)
No
Need readability / multiple references / recursion?
Yes
CTE
No
Result is large / referenced across multiple queries / needs indexing?
Yes
Temporary Table
No
CTE (preferred over complex derived tables)
No
Standard SELECT/JOIN
Key Considerations:
  • Subqueries: Best for simple, one-off filtering conditions or scalar values. Can get difficult to read when deeply nested.
  • CTEs: Excellent for modularizing complex queries, improving readability, and enabling recursive queries. They are typically optimized similarly to derived tables.
  • Temporary Tables: Useful for very large intermediate results that need to be indexed, or for data that needs to be used across multiple, separate queries within the same session. They incur more overhead (creation, storage, cleanup) than CTEs.

Practice & Application

🎯 Practice: Scalar Subquery - Employees Above Average Salary

Retrieve the names and salaries of all employees who earn more than the overall average salary of *all* employees in the company.

Click for Solution

This problem uses a scalar subquery. The inner query calculates a single value (the overall average salary), which the outer query then uses for comparison in its WHERE clause.

SELECT
    E.first_name,
    E.last_name,
    E.salary
FROM
    Employees E
WHERE
    E.salary > (SELECT AVG(salary) FROM Employees); -- Scalar subquery returns single average salary

Explanation:

  • The subquery (SELECT AVG(salary) FROM Employees) executes first, computing the average salary across the entire Employees table.
  • This single average value is then used by the outer SELECT statement's WHERE clause to filter employees.

🎯 Practice: Column Subquery - Departments with Recent Hires

List the names of all departments that have hired at least one employee in the year 2023.

Click for Solution

This problem can be solved efficiently using a column subquery within an IN clause. The subquery will return a list of department IDs that satisfy the condition, which the outer query then uses to filter departments.

-- Assume Departments (department_id, department_name)
-- Assume Employees (employee_id, first_name, last_name, hire_date, department_id)

SELECT
    D.department_name
FROM
    Departments D
WHERE
    D.department_id IN (SELECT DISTINCT department_id
                        FROM Employees
                        WHERE hire_date BETWEEN '2023-01-01' AND '2023-12-31'); -- Column subquery returns list of IDs

Explanation:

  • The inner subquery identifies all unique department_ids from the Employees table where the hire_date falls within 2023.
  • The outer query then selects department_name from the Departments table, including only those departments whose department_id is found in the list generated by the subquery.

🎯 Practice: Employees Earning More Than Department Average (CTE vs. Derived Table)

Find the names and salaries of employees who earn more than the average salary of their own department. Show their salary and their department's average salary.

Click for Solution

This is a classic problem that nicely illustrates the advantage of CTEs for readability compared to a derived table (a subquery in the FROM clause). Both approaches are shown.

Solution 1: Using a Derived Table (Subquery in FROM)

SELECT
    E.first_name,
    E.last_name,
    E.salary,
    DA.avg_dept_salary
FROM
    Employees E
JOIN
    (SELECT department_id, AVG(salary) AS avg_dept_salary
     FROM Employees
     GROUP BY department_id) AS DA -- Derived Table (Table Subquery)
ON E.department_id = DA.department_id
WHERE
    E.salary > DA.avg_dept_salary;

Solution 2: Using a Common Table Expression (CTE)

WITH DepartmentAverage AS (
    SELECT
        department_id,
        AVG(salary) AS avg_dept_salary
    FROM
        Employees
    GROUP BY
        department_id
)
SELECT
    E.first_name,
    E.last_name,
    E.salary,
    DA.avg_dept_salary
FROM
    Employees E
JOIN
    DepartmentAverage DA ON E.department_id = DA.department_id
WHERE
    E.salary > DA.avg_dept_salary;

Explanation:

  • Both solutions first calculate the average salary per department.
  • The Derived Table approach nests this calculation directly into the FROM clause, requiring an alias (`DA`). It can sometimes become hard to read with many nested subqueries.
  • The CTE approach uses WITH DepartmentAverage AS (...) to define this intermediate result set with a clear name upfront. The main SELECT query then references this CTE as if it were a regular table. This greatly enhances readability and modularity, especially for more complex multi-step logic.

🎯 Practice: Chained CTEs - Top 2 Highest Paid Employees Per Department

Find the top 2 highest-paid employees in each department. Include their name, salary, department name, and their rank within their department.

Click for Solution

This problem is best solved using chained CTEs combined with window functions (which will be covered in more detail later, but this demonstrates a common pattern). We'll use ROW_NUMBER() to rank employees within each department.

-- Assume Employees (employee_id, first_name, last_name, salary, department_id)
-- Assume Departments (department_id, department_name)

WITH EmployeeRanked AS (
    SELECT
        E.employee_id,
        E.first_name,
        E.last_name,
        E.salary,
        E.department_id,
        ROW_NUMBER() OVER (PARTITION BY E.department_id ORDER BY E.salary DESC) AS rank_in_dept
    FROM
        Employees E
),
Top2PerDepartment AS (
    SELECT
        ER.employee_id,
        ER.first_name,
        ER.last_name,
        ER.salary,
        ER.department_id,
        ER.rank_in_dept
    FROM
        EmployeeRanked ER
    WHERE
        ER.rank_in_dept <= 2 -- Filter for the top 2 ranked employees
)
SELECT
    T2.first_name,
    T2.last_name,
    T2.salary,
    D.department_name,
    T2.rank_in_dept
FROM
    Top2PerDepartment T2
JOIN
    Departments D ON T2.department_id = D.department_id
ORDER BY
    D.department_name, T2.rank_in_dept;

Explanation:

  • EmployeeRanked CTE: This first CTE calculates a rank for each employee within their respective department. PARTITION BY E.department_id divides the data into separate groups for each department, and ORDER BY E.salary DESC ranks employees within each group by salary in descending order.
  • Top2PerDepartment CTE: This second CTE builds upon the first. It selects all columns from EmployeeRanked and applies a filter to keep only those employees whose rank_in_dept is 1 or 2.
  • Final SELECT: The main query then joins Top2PerDepartment with the Departments table to fetch the actual department names and presents the final, desired output, ordered for readability. This multi-step approach is far clearer than nesting many subqueries.

9. Database Constraints & Indexing Strategy

Ensuring data quality and optimizing database performance are two critical aspects of working with SQL. Constraints guarantee the integrity of your data, while indexes dramatically speed up data retrieval.

The Foundation: Constraints enforce rules at the database level, preventing invalid data from ever being stored. Indexes are like speed lanes, allowing the database to find data much faster.

Data Integrity Constraints

Constraints are rules enforced on data columns in a table. They are used to limit the type of data that can go into a table, ensuring the accuracy and reliability of the data.

`NOT NULL` (Mandatory fields)

Ensures that a column cannot have a NULL value. The field must contain a value.

CREATE TABLE Users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50) NOT NULL -- username cannot be NULL
);

`UNIQUE` (Preventing duplicates)

Ensures that all values in a column (or a set of columns) are unique. It allows NULLs (though typically only one NULL value depending on the RDBMS).

CREATE TABLE Products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100) UNIQUE -- product_name must be unique
);

`PRIMARY KEY` (Unique + Not Null)

A column or set of columns that uniquely identifies each row in a table. It implicitly enforces both UNIQUE and NOT NULL constraints.

CREATE TABLE Orders (
    order_id INT PRIMARY KEY, -- order_id is unique and cannot be NULL
    order_date DATE
);

`FOREIGN KEY` (Relationship enforcement)

A column or set of columns in one table that refers to the PRIMARY KEY (or a UNIQUE key) in another table. It establishes a link between tables and enforces referential integrity.

CREATE TABLE OrderItems (
    item_id INT PRIMARY KEY,
    order_id INT,
    product_id INT,
    FOREIGN KEY (order_id) REFERENCES Orders(order_id)
    -- This ensures order_id must exist in the Orders table
);

`CHECK` (Custom validation logic)

Ensures that all values in a column satisfy a specific condition. You define the logical expression.

CREATE TABLE Employees (
    employee_id INT PRIMARY KEY,
    salary DECIMAL(10, 2) CHECK (salary > 0) -- Salary must be positive
);

`DEFAULT` (Fallback values)

Provides a default value for a column when no value is explicitly specified during an INSERT operation.

CREATE TABLE Tasks (
    task_id INT PRIMARY KEY,
    task_description TEXT,
    status VARCHAR(20) DEFAULT 'Pending' -- Default status is 'Pending'
);

Indexing Fundamentals (Under the Hood)

An index is a special lookup table that the database search engine can use to speed up data retrieval. Without indexes, the database engine would have to scan every row in a table to find matching data, which is slow for large tables.

What is an Index? (The Book Index Analogy)

Think of an index in a database like the index at the back of a textbook:

  • Without an index: If you want to find all mentions of "SQL Joins" in a textbook without an index, you'd have to read every single page from cover to cover. This is a "full table scan" in database terms.
  • With an index: You go to the index, find "SQL Joins", see it's on pages 57, 123, and 201. You immediately jump to those pages. This is how a database uses an index to quickly locate relevant rows.
-- Creating a simple index on a column
CREATE INDEX idx_employee_lastname
ON Employees (last_name);

-- Creating a composite index on multiple columns
CREATE INDEX idx_order_customer_date
ON Orders (customer_id, order_date DESC);

Clustered (Physical order) vs. Non-Clustered (Logical pointer)

These are the two main types of indexes, differing in how they store and manage data.

🌳 Clustered Index

  • Physical Storage: Determines the physical order of data rows in the table. The table data itself is stored in the order of the clustered index.
  • One Per Table: A table can have only one clustered index because the data rows can only be physically sorted in one order.
  • Primary Key: Often, the PRIMARY KEY automatically creates a clustered index.
  • Best for: Range scans, ORDER BY clauses, frequently accessed sequential data.

📑 Non-Clustered Index

  • Logical Pointer: Does not change the physical order of the data. It's a separate structure containing the indexed column(s) and pointers (row locators) to the actual data rows in the table.
  • Multiple Per Table: A table can have many non-clustered indexes.
  • Best for: Specific lookups (WHERE column = 'value'), joins (on the indexed columns).

Trade-offs: Read speed increase vs. Write/Update speed decrease

Indexes are not a free lunch. While they significantly boost read performance (SELECT statements), they come with costs:

  • Pros (Reads):
    • Faster data retrieval for queries using WHERE clauses, JOIN conditions, ORDER BY, GROUP BY on indexed columns.
    • Reduces I/O operations by allowing direct access to relevant data instead of full table scans.
  • Cons (Writes/Updates):
    • Increased Storage: Indexes consume disk space (as they are separate data structures).
    • Slower DML Operations: Every time you INSERT, UPDATE, or DELETE a row in the main table, the corresponding indexes also need to be updated. This adds overhead and slows down write operations.
    • Maintenance Overhead: The database engine needs to maintain indexes, which can impact performance, especially for tables with high write activity.

B-Tree Structure Diagram (Simplified)

Most SQL databases use B-tree (Balanced Tree) structures for their indexes. This structure allows for efficient searching, insertion, and deletion.

Root Node
[Key A] < [Key B] < [Key C]
Internal Node 1
[Key 1] < [Key 2]
Leaf Node 1
(Data Pointers)
Leaf Node 2
(Data Pointers)
Internal Node 2
[Key 3] < [Key 4]
Leaf Node 3
(Data Pointers)
Leaf Node 4
(Data Pointers)
How it works: You start at the Root Node, which guides you to the correct Internal Node, and then finally to a Leaf Node. The Leaf Nodes contain the actual key values and pointers (references) to the full data rows in the table. This tree structure minimizes the number of "jumps" (disk I/O operations) needed to find data.

Practice & Application

🎯 Practice: Designing a Table with Comprehensive Constraints

Design and create a SQL table named CourseEnrollments with the following requirements:

  • An auto-incrementing enrollment_id that serves as the primary key.
  • A student_id (integer) and a course_id (integer) that together form a unique composite key (a student can enroll in a course only once).
  • student_id must reference the student_id in an existing Students table.
  • course_id must reference the course_id in an existing Courses table.
  • An enrollment_date that defaults to the current date.
  • A grade column (VARCHAR(2)) that can be NULL initially but, if provided, must be one of 'A+', 'A', 'B+', 'B', 'C+', 'C', 'D', 'F'.
Click for Solution

This problem demonstrates the creation of a table incorporating multiple data integrity constraints: `PRIMARY KEY`, `FOREIGN KEY`, `UNIQUE` (composite), `DEFAULT`, and `CHECK`.

-- First, assume Students and Courses tables already exist with appropriate PKs:
-- CREATE TABLE Students (student_id INT PRIMARY KEY, student_name VARCHAR(100));
-- CREATE TABLE Courses (course_id INT PRIMARY KEY, course_name VARCHAR(100));

CREATE TABLE CourseEnrollments (
    enrollment_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- PostgreSQL syntax for auto-increment PK
    -- (For MySQL, use INT AUTO_INCREMENT PRIMARY KEY)
    -- (For SQL Server, use INT IDENTITY(1,1) PRIMARY KEY)
    student_id INT NOT NULL,
    course_id INT NOT NULL,
    enrollment_date DATE DEFAULT CURRENT_DATE,
    grade VARCHAR(2) CHECK (grade IN ('A+', 'A', 'B+', 'B', 'C+', 'C', 'D', 'F')),
    
    -- Composite UNIQUE key for student_id and course_id
    CONSTRAINT UQ_StudentCourse UNIQUE (student_id, course_id),
    
    -- Foreign Key referencing Students table
    CONSTRAINT FK_StudentEnrollment FOREIGN KEY (student_id) REFERENCES Students(student_id),
    
    -- Foreign Key referencing Courses table
    CONSTRAINT FK_CourseEnrollment FOREIGN KEY (course_id) REFERENCES Courses(course_id)
);

Explanation:

  • enrollment_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY defines an auto-incrementing primary key.
  • student_id INT NOT NULL and course_id INT NOT NULL ensure these critical identifiers are never missing.
  • UNIQUE (student_id, course_id) forms a composite unique key, preventing a student from being enrolled in the same course more than once.
  • DEFAULT CURRENT_DATE automatically populates the `enrollment_date` if not provided.
  • CHECK (grade IN (...)) restricts the allowed values for the `grade` column.
  • FOREIGN KEY constraints link to the `Students` and `Courses` tables, enforcing referential integrity.

🎯 Practice: Modifying Existing Constraints and Column Definitions

You have an existing Users table defined as:

CREATE TABLE Users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50) UNIQUE,
    email VARCHAR(100),
    registration_date DATE
);
You now need to perform the following DDL operations:
  1. Make the email column `NOT NULL`, assuming all existing rows already have a non-NULL email.
  2. Add a new `last_login_timestamp` column (TIMESTAMP) which should automatically set its value to the current timestamp if not explicitly provided during an INSERT.

Click for Solution

This problem uses `ALTER TABLE` to modify column constraints and add a new column with a default value, demonstrating how to evolve a schema over time.

-- 1. Make the 'email' column NOT NULL
ALTER TABLE Users
ALTER COLUMN email SET NOT NULL; -- PostgreSQL syntax
-- For SQL Server: ALTER TABLE Users ALTER COLUMN email VARCHAR(100) NOT NULL;
-- For MySQL: ALTER TABLE Users MODIFY COLUMN email VARCHAR(100) NOT NULL;

-- 2. Add 'last_login_timestamp' with a default value
ALTER TABLE Users
ADD COLUMN last_login_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP; -- Or NOW() in MySQL/SQL Server

Explanation:

  • The `ALTER COLUMN email SET NOT NULL` command (or its RDBMS-specific equivalent) enforces that future inserts or updates to the `email` column cannot be `NULL`. This operation would fail if there were any existing `NULL` values in the `email` column.
  • The `ADD COLUMN last_login_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP` command adds a new column and ensures that if an `INSERT` statement doesn't provide a value for this column, the database will automatically stamp it with the time of insertion.

🎯 Practice: Strategic Indexing for Query Performance

Consider a large Transactions table with millions of rows, containing columns like transaction_id (PK), customer_id, transaction_date, amount, and status. A frequently run query looks like this:

SELECT transaction_id, transaction_date, amount
FROM Transactions
WHERE customer_id = 12345
  AND transaction_date BETWEEN '2023-01-01' AND '2023-01-31'
  AND status = 'completed'
ORDER BY transaction_date DESC;

Suggest a single composite index that would most effectively speed up this specific query, explaining why.

Click for Solution

This problem tests your ability to design an effective index by considering the `WHERE` clause conditions and the `ORDER BY` clause.

CREATE INDEX idx_transactions_customer_date_status_amount
ON Transactions (customer_id, transaction_date DESC, status, amount);

Explanation:

  • customer_id first: This is the most selective column (often), and it's used with an equality operator (`=`). The index will first efficiently narrow down to transactions for a single customer.
  • transaction_date DESC second: After finding the customer, the query filters by a date range and orders by `transaction_date DESC`. Placing this second in the index (and matching its sort order) allows the database to efficiently find the relevant dates and fulfill the `ORDER BY` clause directly from the index without an additional sort step.
  • status third: This is another equality filter (`status = 'completed'`). Placing it after `customer_id` and `transaction_date` is effective because the data is already pre-sorted for the leading columns, further refining the search.
  • amount last: While `amount` is in the `SELECT` list, including it as a "covering index" (an index that contains all columns needed by the query) can allow the database to answer the query entirely from the index without having to go back to the main table, potentially speeding it up even further. It's listed last because it's not used in `WHERE` or `ORDER BY`.

Why this order? Indexes are most effective when their leading columns match the most restrictive conditions in the WHERE clause. The `ORDER BY` clause also benefits greatly from a matching index order.

🎯 Practice: Discussing Indexing Trade-offs - When NOT to Index

You are managing a `SessionLogs` table. This table receives millions of new records every hour (very high write volume). It has columns such as `session_id` (PK), `user_id`, `timestamp`, `event_type`, and `session_status` (which can only be 'Active', 'Inactive', 'Error').

Your team proposes adding an index on the `session_status` column to speed up queries like `SELECT COUNT(*) FROM SessionLogs WHERE session_status = 'Active';`

Should you implement this index? Discuss the trade-offs and explain your reasoning.

Click for Solution

This problem requires evaluating the costs and benefits of an index in a high-write, low-cardinality scenario, demonstrating a practical understanding of indexing trade-offs.

Reasoning:

  • High Write Volume Impact: The table receives millions of records per hour. Every `INSERT` operation would require updating the `session_status` index. This overhead would significantly slow down write performance, which is a critical concern for this table.
  • Low Cardinality Column: The `session_status` column has very low cardinality (only three distinct values: 'Active', 'Inactive', 'Error'). An index is most effective when it quickly narrows down the search space. For a low-cardinality column, an index on `session_status` would point to a large percentage of the table's rows (e.g., if 50% are 'Active', the index will still lead to scanning half the table). In such cases, the database's query optimizer might determine that a full table scan is *more efficient* than traversing the index and then fetching half the data pages, especially if the data is already physically grouped.
  • Potential Read Benefit (but limited): While queries filtering by `session_status` might theoretically benefit, the actual gain might be minimal due to low cardinality. For `COUNT(*)`, the index might only help slightly if it's a covering index for the count, but it won't drastically speed up filtering for a large proportion of the table.

Conclusion:

Given the extremely high write volume and the low cardinality of the `session_status` column, adding an index on `session_status` is likely **not a good idea**. The performance penalty on `INSERT` operations would outweigh the minimal (if any) read performance gains.

Alternative considerations: For such scenarios, partitioning the table by `timestamp` might be more effective for managing large data volumes and specific date-range queries. If reads on `session_status` are absolutely critical and slow, a materialized view that pre-aggregates counts for each status might be a better approach than a direct index on a low-cardinality column in a high-write table.

10. Set Operations

Set operations in SQL allow you to combine the results of two or more SELECT statements into a single result set. Unlike joins, which combine data horizontally based on related columns, set operations combine data vertically, stacking one result set on top of another.

The Core Idea: Think of set operations as manipulating entire sets of rows, similar to mathematical set theory concepts like union, intersection, and difference.

Vertical Combination of Data

Let's explore the primary set operators:

-- Assume two tables with similar structures:
-- Employees (employee_id, name, city)
-- Contractors (contractor_id, name, city)
-- Both have a 'name' and 'city' column for demonstration.
  • UNION (Removes duplicates, slower):

    Combines the result sets of two or more SELECT statements and returns all distinct rows. Any duplicate rows across the combined result sets are removed.

    SELECT name, city FROM Employees
    UNION
    SELECT name, city FROM Contractors;
    -- Result: All unique names and cities from both tables. If 'John, New York' appears in both, it will only be listed once.
  • UNION ALL (Keeps duplicates, faster):

    Combines the result sets of two or more SELECT statements, but unlike UNION, it retains all duplicate rows. This makes it generally faster as the database doesn't need to perform a distinct sort operation.

    SELECT name, city FROM Employees
    UNION ALL
    SELECT name, city FROM Contractors;
    -- Result: All names and cities from both tables, including duplicates. If 'John, New York' appears in both, it will be listed twice.
  • INTERSECT (Rows in both sets):

    Returns only the rows that are present in both result sets. Only rows that are identical in all selected columns will be returned.

    SELECT name, city FROM Employees
    INTERSECT
    SELECT name, city FROM Contractors;
    -- Result: Only names and cities that exist in BOTH Employees AND Contractors.
  • EXCEPT / MINUS (Rows in A but not B):

    Returns rows from the first SELECT statement that are not found in the second SELECT statement. (MINUS is used in Oracle, EXCEPT in SQL Server, PostgreSQL, MySQL 8.0+).

    SELECT name, city FROM Employees
    EXCEPT -- Or MINUS in Oracle
    SELECT name, city FROM Contractors;
    -- Result: Names and cities that are in Employees but NOT in Contractors.

Requirements

For set operations to work correctly, the participating SELECT statements must adhere to two crucial rules:

  • Column Count Match:

    The number of columns in all SELECT statements being combined must be identical.

    -- This would fail because the second SELECT has fewer columns
    -- SELECT name, city FROM Employees
    -- UNION
    -- SELECT name FROM Contractors; -- ERROR: column mismatch
  • Data Type Compatibility:

    The data types of corresponding columns in each SELECT statement must be compatible (e.g., you can't `UNION` an `INT` with a `VARCHAR` in the same column position). Implicit conversion often occurs if types are related (e.g., `INT` to `DECIMAL`), but it's best to be explicit or ensure types are similar.

    -- This might lead to unexpected results or errors if type mismatch is severe
    -- SELECT employee_id, name FROM Employees
    -- UNION
    -- SELECT date_of_birth, full_name FROM Contacts; -- ERROR or weird conversion

Edge Case: Sorting results in Set Operations

When you use UNION, UNION ALL, INTERSECT, or EXCEPT, the ORDER BY clause can only be specified once, at the very end of the entire combined query. It applies to the final, consolidated result set.

-- Incorrect: ORDER BY inside each SELECT
-- (SELECT name FROM Employees ORDER BY name) UNION (SELECT name FROM Contractors ORDER BY name); -- Syntax Error

-- Correct: ORDER BY at the end, referring to columns by their final selected name/position
SELECT name, city FROM Employees
UNION ALL
SELECT name, city FROM Contractors
ORDER BY name ASC, city DESC; -- Applies to the combined result set
If you need to sort individual sub-queries before a set operation (e.g., for pagination within each subset), you would typically use subqueries or CTEs for that, but the final `ORDER BY` for the combined result always comes last.

Practice & Application

🎯 Practice: Combined Roster (Including Duplicates)

You have two tables: Employees (employee_id, name, department, email) and Contractors (contractor_id, name, department, email). You need to create a single, consolidated list of all individuals (both employees and contractors), showing their name, department, and email. It's important to include all entries, even if an individual (by name and email) appears in both tables.

Click for Solution

This problem requires `UNION ALL`. It combines the result sets of two (or more) queries and retains all duplicate rows. This makes it efficient as the database does not need to perform a costly distinct check.

-- Sample Data (conceptual):
-- Employees: { ('E1', 'Alice', 'HR', 'alice@example.com'), ('E2', 'Bob', 'IT', 'bob@example.com'), ('E3', 'Charlie', 'Sales', 'charlie@example.com') }
-- Contractors: { ('C1', 'Bob', 'IT', 'bob@example.com'), ('C2', 'David', 'Marketing', 'david@example.com') }

SELECT name, department, email FROM Employees
UNION ALL
SELECT name, department, email FROM Contractors
ORDER BY name;

Explanation:

  • The first SELECT retrieves `name`, `department`, `email` from Employees.
  • The second SELECT retrieves the same columns from Contractors.
  • UNION ALL stacks these two result sets on top of each other. If 'Bob' with email 'bob@example.com' exists in both, he will appear twice.
  • The final ORDER BY name sorts the entire combined result set.

Expected Output (Logical, order by name):

namedepartmentemail
AliceHRalice@example.com
BobITbob@example.com
BobITbob@example.com
CharlieSalescharlie@example.com
DavidMarketingdavid@example.com

🎯 Practice: Unique Contact List (Removing Duplicates)

Using the same Employees and Contractors tables, you now need a unique list of all contact emails (excluding names or departments). If an email address is associated with both an employee and a contractor, it should only appear once.

Click for Solution

This problem calls for `UNION`. It combines result sets and automatically removes duplicate rows, ensuring each distinct row appears only once in the final output.

-- Sample Data (conceptual):
-- Employees: { ('E1', 'Alice', 'HR', 'alice@example.com'), ('E2', 'Bob', 'IT', 'bob@example.com'), ('E3', 'Charlie', 'Sales', 'charlie@example.com') }
-- Contractors: { ('C1', 'Bob', 'IT', 'bob@example.com'), ('C2', 'David', 'Marketing', 'david@example.com') }

SELECT email FROM Employees
UNION
SELECT email FROM Contractors
ORDER BY email;

Explanation:

  • Both SELECT statements retrieve only the `email` column.
  • UNION combines these lists and then performs a distinct operation, removing any `email` that appears in both original lists.
  • The final ORDER BY email sorts the unique email addresses.

Expected Output (Logical, order by email):

email
alice@example.com
bob@example.com
charlie@example.com
david@example.com

🎯 Practice: Finding Dual-Role Individuals (Intersection)

Your team suspects that some individuals might be listed in both the Employees table and the Contractors table, potentially indicating a data anomaly or a special dual role. Identify the names and emails of all individuals who exist in *both* tables.

Click for Solution

To find common rows between two result sets, the `INTERSECT` operator is used. It returns only the rows that are present in both the first and second `SELECT` statements.

-- Sample Data (conceptual):
-- Employees: { ('E1', 'Alice', 'HR', 'alice@example.com'), ('E2', 'Bob', 'IT', 'bob@example.com'), ('E3', 'Charlie', 'Sales', 'charlie@example.com') }
-- Contractors: { ('C1', 'Bob', 'IT', 'bob@example.com'), ('C2', 'David', 'Marketing', 'david@example.com'), ('C3', 'Alice', 'Sales', 'alice@example.com') }
-- (Note: Alice in Contractors has a different department, but same name/email)

SELECT name, email FROM Employees
INTERSECT
SELECT name, email FROM Contractors
ORDER BY name;

Explanation:

  • INTERSECT compares the rows produced by the two SELECT statements.
  • Only rows where *all* selected columns (`name` and `email`) are identical in both sets are included in the final result.
  • In our sample data, 'Bob' and 'Alice' are present with the same name and email in both lists.

Expected Output (Logical, order by name):

nameemail
Alicealice@example.com
Bobbob@example.com

🎯 Practice: Exclusive Employees (Difference)

You need to identify employees who are *only* in the Employees table and have no corresponding entry (by name and email) in the Contractors table. List their name and email.

Click for Solution

The `EXCEPT` (or `MINUS` in Oracle) operator is used here. It returns rows from the first query that are not present in the second query.

-- Sample Data (conceptual):
-- Employees: { ('E1', 'Alice', 'HR', 'alice@example.com'), ('E2', 'Bob', 'IT', 'bob@example.com'), ('E3', 'Charlie', 'Sales', 'charlie@example.com') }
-- Contractors: { ('C1', 'Bob', 'IT', 'bob@example.com'), ('C2', 'David', 'Marketing', 'david@example.com'), ('C3', 'Alice', 'Sales', 'alice@example.com') }

SELECT name, email FROM Employees
EXCEPT -- Use MINUS in Oracle
SELECT name, email FROM Contractors
ORDER BY name;

Explanation:

  • The first SELECT gets all names and emails from Employees.
  • The second SELECT gets all names and emails from Contractors.
  • EXCEPT then subtracts the second set from the first. Any `(name, email)` pair that exists in both will be removed from the first set's result.
  • Only 'Charlie' remains because 'Alice' and 'Bob' (with their respective emails) are also found in the Contractors table.

Expected Output (Logical, order by name):

nameemail
Charliecharlie@example.com

11. Views and Stored Procedures

Beyond raw tables and ad-hoc queries, databases offer mechanisms to encapsulate logic and improve performance, security, and reusability. Views and Stored Procedures are two such powerful features.

The Goal: To abstract complexity, enforce security, and optimize frequently executed operations.

Views (Virtual Tables)

A view is a virtual table based on the result-set of a SQL query. It contains rows and columns, just like a real table, but it does not store data itself. Instead, it retrieves data from the underlying tables whenever it is queried.

CREATE VIEW ActiveEmployees AS
SELECT
    employee_id,
    first_name,
    last_name,
    email,
    department_id
FROM
    Employees
WHERE
    status = 'Active';

-- You can now query the view just like a table:
SELECT * FROM ActiveEmployees WHERE department_id = 1;
  • Hiding complexity / Business logic encapsulation:

    Views allow you to pre-package complex queries (e.g., involving multiple joins, aggregations, or complex CASE statements) and present them as a simple table. This simplifies interactions for users or applications who don't need to understand the underlying complexity.

    -- Instead of users writing this complex query every time:
    SELECT
        C.customer_name,
        SUM(O.total_amount) AS total_spent,
        COUNT(O.order_id) AS num_orders
    FROM
        Customers C
    LEFT JOIN
        Orders O ON C.customer_id = O.customer_id
    GROUP BY
        C.customer_name
    HAVING
        COUNT(O.order_id) > 0;
    
    -- You can define a view:
    CREATE VIEW CustomerSpendingSummary AS
    SELECT
        C.customer_name,
        SUM(O.total_amount) AS total_spent,
        COUNT(O.order_id) AS num_orders
    FROM
        Customers C
    LEFT JOIN
        Orders O ON C.customer_id = O.customer_id
    GROUP BY
        C.customer_name
    HAVING
        COUNT(O.order_id) > 0;
    
    -- And then users just query the view:
    SELECT * FROM CustomerSpendingSummary WHERE total_spent > 1000;
  • Security (Restricting access to specific columns/rows):

    Views can act as a security mechanism. Instead of granting users direct access to sensitive base tables, you can grant them access only to a view that exposes a subset of columns or rows. For example, a view might exclude salary or personal identifying information.

    -- Granting a user access to the Employees table exposes ALL columns.
    -- Instead, create a view that hides sensitive data:
    CREATE VIEW PublicEmployeeInfo AS
    SELECT
        employee_id,
        first_name,
        last_name,
        department_id
    FROM
        Employees;
    
    -- Then grant access ONLY to the view:
    -- GRANT SELECT ON PublicEmployeeInfo TO 'reporting_user';
  • Materialized Views (Cached data for performance - Conceptual):

    While standard views are virtual and re-execute their query every time they are accessed, materialized views (supported in some RDBMS like PostgreSQL, Oracle, SQL Server) are different. They physically store the result of the query at the time of their creation or last refresh.

    • Pros: Much faster read performance, especially for complex aggregations, as the data is pre-calculated.
    • Cons: The data in a materialized view can become stale. It needs to be explicitly refreshed (manually or on a schedule) to reflect changes in the underlying base tables.
    -- PostgreSQL syntax example:
    CREATE MATERIALIZED VIEW DailySalesSummary AS
    SELECT
        order_date,
        SUM(total_amount) AS daily_revenue,
        COUNT(order_id) AS total_orders
    FROM
        Orders
    GROUP BY
        order_date;
    
    -- To update its data:
    -- REFRESH MATERIALIZED VIEW DailySalesSummary;

Stored Procedures

A stored procedure is a prepared SQL code that you can save, so the code can be reused over and over again. It's a collection of SQL statements and control-of-flow statements (like `IF`, `LOOP`).

-- PostgreSQL syntax example (syntax varies significantly by RDBMS)
CREATE PROCEDURE GetEmployeeCountByDepartment(
    IN dept_id INT,
    OUT employee_count INT
)
LANGUAGE plpgsql
AS $$
BEGIN
    SELECT COUNT(*) INTO employee_count
    FROM Employees
    WHERE department_id = dept_id;
END;
$$;

-- To execute (PostgreSQL):
-- CALL GetEmployeeCountByDepartment(1, NULL); -- Result will be in a message or output parameter
-- For SQL Server: EXEC GetEmployeeCountByDepartment @dept_id = 1, @employee_count = @count OUTPUT;
  • Pre-compiled SQL code:

    When a stored procedure is created, the database compiles it once and stores the execution plan. Subsequent executions use this pre-compiled plan, which can result in faster performance compared to executing raw SQL statements repeatedly.

  • Accepting parameters:

    Stored procedures can accept input parameters, allowing for dynamic and flexible execution. They can also return output parameters or result sets.

    -- Example with input and output parameters
    CREATE PROCEDURE UpdateEmployeeSalary (
        IN employee_id_param INT,
        IN percentage_raise DECIMAL(5, 2),
        OUT new_salary_out DECIMAL(10, 2)
    )
    LANGUAGE plpgsql
    AS $$
    BEGIN
        UPDATE Employees
        SET salary = salary * (1 + percentage_raise)
        WHERE employee_id = employee_id_param;
    
        SELECT salary INTO new_salary_out
        FROM Employees
        WHERE employee_id = employee_id_param;
    END;
    $$;
    
    -- CALL UpdateEmployeeSalary(101, 0.05, NULL);
  • Reducing network traffic:

    Instead of sending multiple complex SQL statements over the network, an application can simply send a single command to execute a stored procedure. The procedure then executes on the database server, reducing network latency and traffic.

  • Security: Similar to views, stored procedures can provide an extra layer of security. Users can be granted permission to execute a procedure without having direct permissions on the underlying tables, protecting sensitive data and enforcing business rules.
  • Encapsulation of Business Logic: Complex business rules and workflows can be implemented directly within the database, ensuring consistency across all applications that interact with the data.

Practice & Application

🎯 Practice: Creating a Restricted Employee View

Your company's HR department needs a simplified view of employee data. Create a view named HR_Employee_Overview that includes employee_id, first_name, last_name, department_id, and hire_date for all active employees. This view should explicitly exclude sensitive information like salaries or full email addresses.

Click for Solution

This problem demonstrates how to use a view to simplify complex queries and enhance security by restricting access to a subset of columns and rows from a base table.

-- Assume an existing 'Employees' table with:
-- employee_id, first_name, last_name, salary, department_id, hire_date, email, is_active

CREATE VIEW HR_Employee_Overview AS
SELECT
    employee_id,
    first_name,
    last_name,
    department_id,
    hire_date
FROM
    Employees
WHERE
    is_active = TRUE; -- Assuming a boolean 'is_active' column

-- To use the view:
-- SELECT * FROM HR_Employee_Overview WHERE department_id = 5;

Explanation:

  • The CREATE VIEW statement defines a virtual table based on the provided SELECT query.
  • Only the specified columns (`employee_id`, `first_name`, `last_name`, `department_id`, `hire_date`) are made available through the view, effectively hiding salary and email.
  • The `WHERE is_active = TRUE` clause pre-filters the rows, ensuring only active employees are ever seen through this view.
  • HR staff can now query `HR_Employee_Overview` as if it were a regular table, without needing to write the complex `SELECT` statement or having direct access to sensitive data in the underlying `Employees` table.

🎯 Practice: Customer Spending Summary View

Create a view named Customer_Revenue_Summary that shows each customer's name, their total number of orders, and their total lifetime spending. Only include customers who have placed at least one order. Link this to the Departments table to also show the department name for employees.

Click for Solution

This problem demonstrates creating a view that encapsulates more complex logic, including joins and aggregation, making a derived dataset readily available for reporting.

-- Assume Customers (customer_id, customer_name, email)
-- Assume Orders (order_id, customer_id, order_date, total_amount, status)

CREATE VIEW Customer_Revenue_Summary AS
SELECT
    C.customer_name,
    COUNT(O.order_id) AS total_orders,
    SUM(O.total_amount) AS total_spending
FROM
    Customers C
INNER JOIN -- Use INNER JOIN to only include customers who have placed orders
    Orders O ON C.customer_id = O.customer_id
GROUP BY
    C.customer_name
ORDER BY
    total_spending DESC; -- ORDER BY is allowed within the view definition, but usually applied to query against view. This is for presentation.

-- To use the view:
-- SELECT customer_name, total_spending FROM Customer_Revenue_Summary WHERE total_orders > 5;

Explanation:

  • The view performs an INNER JOIN between Customers and Orders to link customers to their transactions.
  • GROUP BY C.customer_name aggregates the results for each unique customer.
  • COUNT(O.order_id) and SUM(O.total_amount) calculate the total orders and spending per customer, respectively.
  • This view simplifies access to customer revenue metrics, allowing analysts to query a single, pre-calculated source without needing to re-write the joins and aggregations.

🎯 Practice: Stored Procedure to Update Employee Department

Create a stored procedure named TransferEmployee that takes an employee_id and a new_department_id as input. The procedure should update the specified employee's department. It should also include a simple check to ensure the new_department_id actually exists in the Departments table.

Click for Solution

This problem demonstrates creating a stored procedure with input parameters, DML operations (`UPDATE`), and basic conditional logic for data validation.

-- Assume Employees (employee_id, first_name, department_id)
-- Assume Departments (department_id, department_name)

-- PostgreSQL syntax (syntax varies significantly by RDBMS, e.g., SQL Server uses CREATE PROCEDURE)
CREATE OR REPLACE PROCEDURE TransferEmployee(
    IN p_employee_id INT,
    IN p_new_department_id INT
)
LANGUAGE plpgsql
AS $$
DECLARE
    dept_exists BOOLEAN;
BEGIN
    -- Check if the new department ID exists
    SELECT EXISTS (SELECT 1 FROM Departments WHERE department_id = p_new_department_id)
    INTO dept_exists;

    IF dept_exists THEN
        UPDATE Employees
        SET department_id = p_new_department_id
        WHERE employee_id = p_employee_id;
        RAISE NOTICE 'Employee % transferred to Department % successfully.', p_employee_id, p_new_department_id;
    ELSE
        RAISE EXCEPTION 'Department ID % does not exist.', p_new_department_id;
    END IF;
END;
$$;

-- To execute (PostgreSQL):
-- CALL TransferEmployee(101, 2);   -- Should succeed
-- CALL TransferEmployee(102, 999); -- Should raise an exception/error

Explanation:

  • The procedure TransferEmployee accepts two input parameters: `p_employee_id` and `p_new_department_id`.
  • It first declares a variable `dept_exists` to store the result of a check.
  • An `EXISTS` subquery efficiently verifies if the `p_new_department_id` is valid by looking it up in the `Departments` table.
  • An `IF/ELSE` block then decides whether to proceed with the `UPDATE` (if the department exists) or raise an exception (if it doesn't). This basic error handling improves data integrity.
  • RAISE NOTICE provides feedback on successful execution, while RAISE EXCEPTION stops the procedure and signals an error for invalid input.

🎯 Practice: Stored Procedure to Retrieve Filtered Products

Create a stored procedure named GetProductsByPriceRange that takes two input parameters: min_price (DECIMAL) and max_price (DECIMAL). The procedure should return all `product_id`, `product_name`, and `price` for products whose price falls within the specified range (inclusive).

Click for Solution

This problem demonstrates a stored procedure primarily used for data retrieval with parameters, offering a reusable way to fetch filtered results.

-- Assume Products (product_id, product_name, price, category_id)

-- PostgreSQL syntax (syntax varies significantly by RDBMS)
CREATE OR REPLACE PROCEDURE GetProductsByPriceRange(
    IN p_min_price DECIMAL(10, 2),
    IN p_max_price DECIMAL(10, 2)
)
LANGUAGE plpgsql
AS $$
BEGIN
    SELECT
        product_id,
        product_name,
        price
    FROM
        Products
    WHERE
        price BETWEEN p_min_price AND p_max_price
    ORDER BY
        price ASC;
END;
$$;

-- To execute (PostgreSQL):
-- CALL GetProductsByPriceRange(10.00, 50.00);

Explanation:

  • The procedure GetProductsByPriceRange accepts `p_min_price` and `p_max_price` as input.
  • Inside the `BEGIN...END` block, a standard SELECT query is executed.
  • The `WHERE` clause uses the input parameters with the `BETWEEN` operator to filter products within the specified price range.
  • The `ORDER BY price ASC` ensures the results are returned in a consistent, sorted manner.
  • When called, the procedure executes this `SELECT` query and returns its result set to the caller. This centralizes the query logic and allows applications to request filtered data with simple calls.

12. Window Functions (The "Cheat Code")

Window functions are a powerful and often "advanced" SQL feature that allow you to perform calculations across a set of table rows that are related to the current row. What makes them unique is that, unlike aggregate functions (like SUM() with GROUP BY), window functions do not collapse the rows into a single output row; they retain the individual row details while still providing aggregated or ranked insights.

Conceptual Understanding: Think of a "window" as a flexible frame of rows defined by your query. A window function performs a calculation over these rows, but for *each* row, it returns a result based on its defined window. The original number of rows remains unchanged.

Conceptual Understanding

  • Aggregation without collapsing rows:

    With GROUP BY and aggregate functions, you get one row per group. With window functions, you get one output row for each input row, but each output row includes a calculated value based on a "window" of related rows.

    -- Example: Get each employee's salary AND the average salary of their department
    SELECT
        employee_id,
        first_name,
        salary,
        department_id,
        AVG(salary) OVER (PARTITION BY department_id) AS avg_dept_salary -- Window function!
    FROM
        Employees;
    -- This returns every employee's row, with an additional column showing their department's average.
  • The OVER() clause anatomy:

    The magic of window functions lies within the OVER() clause. It defines the "window" or set of rows on which the function operates. It has two main components:

    • PARTITION BY: Divides the result set into partitions (groups) to which the window function is independently applied. Think of it as the "grouping" mechanism for the window.
    • ORDER BY: Orders the rows within each partition. This is crucial for functions that depend on order, like ranking or lead/lag.

Ranking Functions

These functions assign a rank to each row within its partition based on the specified order.

  • ROW_NUMBER() (Unique increment):

    Assigns a unique, sequential integer to each row within its partition, starting from 1. If there are ties in the ordering, ROW_NUMBER() assigns arbitrary but unique ranks.

    SELECT
        employee_id,
        first_name,
        salary,
        department_id,
        ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank_in_dept
    FROM
        Employees;
  • RANK() (Skips numbers on ties):

    Assigns the same rank to rows that have identical values in the ORDER BY clause. If there are ties, the next rank value will "skip" numbers (e.g., if two rows are rank 1, the next unique rank will be 3).

  • DENSE_RANK() (No skipping on ties):

    Similar to RANK(), it assigns the same rank to tied rows. However, DENSE_RANK() does not skip rank numbers after a tie (e.g., if two rows are rank 1, the next unique rank will be 2).

Table Comparison of Ranking Functions

Let's illustrate ROW_NUMBER(), RANK(), and DENSE_RANK() within a single partition (e.g., for one department, ordered by salary DESC).

Employee Salary ROW_NUMBER() RANK() DENSE_RANK()
Alice100,000111
Bob90,000222
Charlie80,000333
David80,000433
Eve70,000554
Frank70,000654
Grace70,000754
Hannah60,000885
Key Observation:
  • `ROW_NUMBER()` is sequential and unique, even for ties.
  • `RANK()` skips numbers (e.g., after two 80k salaries get rank 3, Eve gets rank 5, skipping 4).
  • `DENSE_RANK()` does not skip numbers (e.g., after two 80k salaries get rank 3, Eve gets rank 4).

Positional Functions

These functions allow you to access data from preceding or succeeding rows within the current row's window.

  • LAG() (Previous row access):

    Retrieves a value from a row that comes before the current row within the partition, based on the `ORDER BY` clause. Useful for calculating differences or seeing previous states.

    -- Get each employee's salary and the salary of the employee hired immediately before them in their department
    SELECT
        employee_id,
        first_name,
        hire_date,
        salary,
        department_id,
        LAG(salary, 1) OVER (PARTITION BY department_id ORDER BY hire_date) AS previous_hire_salary
    FROM
        Employees;
  • LEAD() (Next row access):

    Retrieves a value from a row that comes after the current row within the partition, based on the `ORDER BY` clause. Useful for forecasting or comparing with future states.

    -- Get each employee's salary and the salary of the employee hired immediately after them in their department
    SELECT
        employee_id,
        first_name,
        hire_date,
        salary,
        department_id,
        LEAD(salary, 1, 0) OVER (PARTITION BY department_id ORDER BY hire_date) AS next_hire_salary -- 0 is default if no next row
    FROM
        Employees;

Partitioning & Ordering

The `OVER()` clause components are critical for defining the scope and sequence of your window function calculations.

  • PARTITION BY (Resetting the window):

    Defines the groups of rows over which the window function operates. When the `PARTITION BY` column's value changes, the window (and thus the function's calculation) "resets".

    Example: AVG(salary) OVER (PARTITION BY department_id) calculates a separate average for each unique department_id.

    If `PARTITION BY` is omitted, the entire result set is treated as a single partition (one big window).

  • ORDER BY (Sorting within the window):

    Determines the logical order of rows within each partition. This is especially important for ranking functions (like ROW_NUMBER()) and positional functions (like LAG()/LEAD()) as their results are order-dependent.

    Example: ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) ranks employees within each department from highest to lowest salary.

FRAME Clause (Optional but powerful): Beyond `PARTITION BY` and `ORDER BY`, the `OVER()` clause can also include a FRAME clause (e.g., `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`). This defines a *subset* of rows within the partition on which the function operates, allowing for calculations like running totals or moving averages. This is typically an advanced topic for another time.

Practice & Application

🎯 Practice: Departmental Salary Analytics

For each employee, retrieve their first_name, last_name, salary, and department_id. In addition, display two new columns: the average salary for their respective department and the highest salary within their department. All details for each individual employee should remain in their own row.

Click for Solution

This problem uses window functions to perform aggregation (AVG() and MAX()) over a partition (department_id) without collapsing the individual employee rows. This means each employee's record is displayed, but enriched with departmental summary statistics.

-- Assume Employees table: (employee_id, first_name, last_name, salary, department_id, hire_date)

SELECT
    first_name,
    last_name,
    salary,
    department_id,
    AVG(salary) OVER (PARTITION BY department_id) AS avg_dept_salary,
    MAX(salary) OVER (PARTITION BY department_id) AS max_dept_salary
FROM
    Employees
ORDER BY
    department_id, salary DESC;

Explanation:

  • AVG(salary) OVER (PARTITION BY department_id) calculates the average salary for each distinct `department_id`. This average is then displayed on every row belonging to that department.
  • Similarly, MAX(salary) OVER (PARTITION BY department_id) finds the maximum salary within each `department_id` and attaches it to every row within that department.
  • The crucial aspect is that these calculations happen "over" a window of rows, but the original number of rows (employees) is preserved.

🎯 Practice: Demonstrating Ranking Functions

For employees within a specific department (e.g., department_id = 101), list their first_name, salary, and their rank based on salary in descending order. Show the results from ROW_NUMBER(), RANK(), and DENSE_RANK() side-by-side to observe their differences, especially with tied salaries.

Click for Solution

This problem directly illustrates the behavior of the three main ranking window functions when dealing with ties. We'll filter for a single department to make the output clear, assuming sample data with salary ties.

-- Assume Employees table with sample data for department 101:
-- | employee_id | first_name | salary | department_id |
-- |-------------|------------|--------|---------------|
-- | 1           | Alice      | 100000 | 101           |
-- | 2           | Bob        | 90000  | 101           |
-- | 3           | Charlie    | 80000  | 101           |
-- | 4           | David      | 80000  | 101           |
-- | 5           | Eve        | 70000  | 101           |
-- | 6           | Frank      | 70000  | 101           |

SELECT
    first_name,
    salary,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
    RANK() OVER (ORDER BY salary DESC) AS rank_val,
    DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank_val
FROM
    Employees
WHERE
    department_id = 101
ORDER BY
    salary DESC, first_name;

Explanation:

  • The OVER (ORDER BY salary DESC) clause defines the window for all three functions. Since there's no PARTITION BY, the window covers all rows for `department_id = 101`.
  • ROW_NUMBER() assigns a unique, sequential number to each row, even when salaries are tied.
  • RANK() assigns the same rank to tied salaries and then skips numbers for the subsequent ranks. For example, if two employees rank 3rd, the next rank will be 5th.
  • DENSE_RANK() assigns the same rank to tied salaries but does *not* skip numbers, leading to a continuous sequence of ranks. If two employees rank 3rd, the next rank will be 4th.

This output precisely matches the behavior depicted in the "Table Comparison of Ranking Functions" visual.

🎯 Practice: Analyzing Salary Evolution Over Time (LAG)

For each employee, retrieve their first_name, last_name, hire_date, and salary. Additionally, calculate the salary of the employee hired immediately before them within the same department. If an employee is the first hire in their department, show NULL for the previous employee's salary.

Click for Solution

This problem utilizes the LAG() window function to access data from a preceding row within a defined partition, enabling "previous value" comparisons.

-- Assume Employees table: (employee_id, first_name, last_name, salary, department_id, hire_date)

SELECT
    first_name,
    last_name,
    hire_date,
    salary,
    department_id,
    LAG(salary, 1) OVER (PARTITION BY department_id ORDER BY hire_date ASC) AS previous_hire_salary
FROM
    Employees
ORDER BY
    department_id, hire_date ASC;

Explanation:

  • LAG(salary, 1) retrieves the `salary` value from the row 1 position `LAG` (before) the current row.
  • PARTITION BY department_id ensures that the `LAG` function restarts its count and comparison for each new department, preventing an employee from being compared with someone from a different department.
  • ORDER BY hire_date ASC orders the employees within each department by their hire date. This is crucial because `LAG()` (and `LEAD()`) depend on a defined order to determine what "previous" or "next" means.
  • For the first employee hired in any department, `previous_hire_salary` will be NULL, as there is no preceding row in that window.

🎯 Practice: Calculating Running Total of Sales Per Customer

You have an Orders table with customer_id, order_id, order_date, and amount. For each order a customer makes, calculate their cumulative spending up to that specific order date, and include this cumulative total on each order row. Order the results by customer, then by order date.

Click for Solution

This problem uses the SUM() aggregate function as a window function with a specific frame clause, allowing for the calculation of running totals.

-- Assume Orders table: (order_id, customer_id, order_date, amount)

SELECT
    customer_id,
    order_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ASC) AS cumulative_spending
    -- The default frame for ORDER BY is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
FROM
    Orders
ORDER BY
    customer_id, order_date ASC;

Explanation:

  • SUM(amount) is the aggregate function.
  • PARTITION BY customer_id ensures that the cumulative sum is calculated independently for each customer. The sum resets when the customer changes.
  • ORDER BY order_date ASC specifies the order within each customer's partition. This order is essential for a running total, as it dictates what "up to this point" means.
  • The default frame clause for aggregate window functions when `ORDER BY` is present (and no `ROWS` or `RANGE` is specified) is implicitly ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This tells the function to sum all `amount` values from the beginning of the partition up to and including the current row.

13. Common SQL Interview Scenarios & Patterns

SQL interviews often revolve around a set of recurring problems and data manipulation challenges. Mastering these patterns will significantly boost your confidence and performance.

The Strategy: These scenarios test your ability to combine foundational SQL concepts (joins, aggregation, subqueries, window functions) to solve practical data problems.

Duplicate Handling

Dealing with duplicate records is a very common task in data cleaning and analysis.

  • Finding duplicates (`GROUP BY` + `HAVING count > 1`):

    To identify rows that have duplicate values for one or more columns, group by those columns and use HAVING to filter for groups with a count greater than 1.

    -- Find duplicate emails in the Employees table
    SELECT
        email,
        COUNT(email) AS num_duplicates
    FROM
        Employees
    GROUP BY
        email
    HAVING
        COUNT(email) > 1;
    This identifies the *values* that are duplicated. To see the *actual duplicate rows*, you'd typically join this result back to the original table.
  • Deleting duplicates (Using CTEs or Row_Number):

    This is a trickier task as you usually want to keep one instance of the duplicate. Window functions, especially ROW_NUMBER(), are ideal for this.

    -- Delete duplicate rows, keeping one instance (e.g., the one with the lowest employee_id)
    WITH EmployeeDuplicates AS (
        SELECT
            *, -- Select all columns to uniquely identify the row
            ROW_NUMBER() OVER (PARTITION BY email ORDER BY employee_id ASC) AS rn
        FROM
            Employees
    )
    DELETE FROM Employees
    WHERE employee_id IN (
        SELECT employee_id FROM EmployeeDuplicates WHERE rn > 1
    );
    -- Or if your RDBMS supports DELETE FROM CTE directly (e.g., SQL Server):
    -- DELETE FROM EmployeeDuplicates WHERE rn > 1;

Top N Analysis

Finding the highest, lowest, or "N-th" value/record is a frequent interview question.

  • Finding N-th highest/lowest value (Subquery vs. DENSE_RANK):

    Window functions provide the most elegant solution for this. DENSE_RANK() is often preferred over RANK() because it handles ties without skipping subsequent ranks, which typically aligns with how "N-th" is interpreted.

    -- Find the 2nd highest salary
    WITH RankedSalaries AS (
        SELECT
            employee_id,
            first_name,
            last_name,
            salary,
            DENSE_RANK() OVER (ORDER BY salary DESC) AS salary_rank
        FROM
            Employees
    )
    SELECT
        first_name,
        last_name,
        salary
    FROM
        RankedSalaries
    WHERE
        salary_rank = 2;

Step-by-step logic for N-th highest value

Let's illustrate how DENSE_RANK() helps find the N-th highest value, using the 2nd highest salary as an example.

1. Start with Data

Employees Table:

IDSalary
1100k
290k
380k
480k
570k

2. Apply DENSE_RANK()

OVER (ORDER BY Salary DESC)

IDSalaryDense_Rank
1100k1
290k2
380k3
480k3
570k4

3. Filter by Rank

WHERE Dense_Rank = 2

IDSalaryDense_Rank
290k2
This method is robust against ties. If multiple employees share the 2nd highest salary, all of them would be returned.

Cumulative Metrics

Calculating running totals or moving averages helps analyze trends over time.

  • Running Totals (`SUM() OVER(ORDER BY...)`):

    As seen in the Window Functions section, this is a direct application of the SUM() aggregate as a window function with an implicit (or explicit) frame clause.

    -- Calculate a running total of order amounts by customer over time
    SELECT
        customer_id,
        order_date,
        amount,
        SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total_amount
    FROM
        Orders
    ORDER BY
        customer_id, order_date;
  • Moving Averages:

    Similar to running totals, but you define a "moving window" (frame) to calculate the average over a specified number of preceding/following rows.

    -- Calculate a 3-day moving average of daily sales
    SELECT
        sale_date,
        daily_sales,
        AVG(daily_sales) OVER (ORDER BY sale_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_day_moving_avg
    FROM
        DailySales;

Data Transformation

SQL provides many functions to clean, reformat, and derive new values from existing data.

  • String manipulation (`LEFT`, `RIGHT`, `SUBSTRING`, `CONCAT`):

    Essential for cleaning, parsing, and formatting text data.

    SELECT
        full_name,
        LEFT(full_name, 5) AS first_5_chars,
        RIGHT(full_name, 3) AS last_3_chars,
        SUBSTRING(full_name, 7, 4) AS middle_4_chars, -- Start at 7th char, take 4
        CONCAT(first_name, ' ', last_name) AS full_name_concat
    FROM
        Customers;
  • Date manipulation (`DATEDIFF`, `DATEADD`, Extracting Year/Month):

    Crucial for time-series analysis, age calculations, and filtering by date parts.

    SELECT
        order_date,
        EXTRACT(YEAR FROM order_date) AS order_year, -- PostgreSQL/MySQL
        EXTRACT(MONTH FROM order_date) AS order_month,
        DATEDIFF('day', hire_date, CURRENT_DATE) AS days_since_hire, -- PostgreSQL/SQL Server syntax differs
        DATEADD('month', 3, order_date) AS three_months_later_date -- PostgreSQL/SQL Server syntax differs
    FROM
        Orders;
  • Coalescing (`COALESCE` to handle NULLs):

    COALESCE() returns the first non-NULL expression in a list. It's excellent for providing default values when a column might be NULL.

    -- Display phone_number, but if NULL, display 'N/A'
    SELECT
        employee_id,
        first_name,
        COALESCE(phone_number, 'N/A') AS contact_phone
    FROM
        Employees;

Practice & Application

🎯 Practice: Identifying and Cleaning Duplicate Customer Emails

You have a Customers table with customer_id, first_name, last_name, and email.

  1. First, find all email addresses that appear more than once in the table.
  2. Then, write a query to delete all but one instance of each duplicate customer, keeping the row with the lowest customer_id for each unique email address.

Click for Solution

This problem demonstrates the crucial steps for duplicate handling: first identifying them, and then selectively deleting them while preserving data integrity.

-- Assume Customers table:
-- (customer_id INT PRIMARY KEY, first_name VARCHAR, last_name VARCHAR, email VARCHAR)
-- Sample data could have:
-- (1, 'Alice', 'Smith', 'alice@example.com')
-- (2, 'Bob', 'Jones', 'bob@example.com')
-- (3, 'Charlie', 'Brown', 'alice@example.com') -- Duplicate email for Alice
-- (4, 'Diana', 'Prince', 'diana@example.com')
-- (5, 'Eve', 'Green', 'bob@example.com')     -- Duplicate email for Bob

-- 1. Find all duplicate email addresses
SELECT
    email,
    COUNT(email) AS duplicate_count
FROM
    Customers
GROUP BY
    email
HAVING
    COUNT(email) > 1;

Explanation for Part 1:

  • GROUP BY email groups all rows that share the same email address.
  • HAVING COUNT(email) > 1 filters these groups, showing only those email addresses that appeared more than once.
-- 2. Delete duplicate customers, keeping the one with the lowest customer_id
WITH RankedCustomers AS (
    SELECT
        customer_id,
        email,
        ROW_NUMBER() OVER (PARTITION BY email ORDER BY customer_id ASC) AS rn
    FROM
        Customers
)
DELETE FROM Customers
WHERE customer_id IN (
    SELECT customer_id FROM RankedCustomers WHERE rn > 1
);
-- For SQL Server, a more direct DELETE from CTE might be used:
-- DELETE RC FROM RankedCustomers RC WHERE RC.rn > 1;

Explanation for Part 2:

  • The RankedCustomers CTE assigns a ROW_NUMBER to each customer, partitioned by `email`. For each unique email, the `ROW_NUMBER` starts at 1. `ORDER BY customer_id ASC` ensures that the lowest `customer_id` for a given email gets `rn=1`.
  • The outer `DELETE` statement then targets all `customer_id`s from the `RankedCustomers` CTE where `rn > 1`. These are the "extra" duplicate records that we want to remove, leaving only the one with `rn=1`.

🎯 Practice: Finding the 3rd Highest Selling Product per Category

You have Products (product_id, product_name, category_id) and OrderItems (order_item_id, product_id, quantity, price_per_unit). Find the product_name, category_id, and total sales revenue for the 3rd highest-selling product in each category. If a category has fewer than 3 products, don't return anything for that category or just the available top products if there are any.

Click for Solution

This problem combines aggregation, joins, and the DENSE_RANK() window function within a CTE to perform a "Top N per Group" analysis.

-- Assume Products and OrderItems tables exist.
-- Sample data:
-- Products: (1, 'Laptop', 101), (2, 'Mouse', 101), (3, 'Keyboard', 101), (4, 'Monitor', 101), (5, 'Desk', 102), (6, 'Chair', 102)
-- OrderItems: (OI1, 1, 5, 1000), (OI2, 1, 2, 1000), (OI3, 2, 10, 20), (OI4, 3, 8, 50), (OI5, 4, 3, 200), (OI6, 5, 2, 150), (OI7, 5, 1, 150), (OI8, 6, 4, 80)

WITH ProductSales AS (
    SELECT
        P.product_id,
        P.product_name,
        P.category_id,
        SUM(OI.quantity * OI.price_per_unit) AS total_revenue
    FROM
        Products P
    JOIN
        OrderItems OI ON P.product_id = OI.product_id
    GROUP BY
        P.product_id, P.product_name, P.category_id
),
RankedProductSales AS (
    SELECT
        product_id,
        product_name,
        category_id,
        total_revenue,
        DENSE_RANK() OVER (PARTITION BY category_id ORDER BY total_revenue DESC) AS rank_in_category
    FROM
        ProductSales
)
SELECT
    product_name,
    category_id,
    total_revenue
FROM
    RankedProductSales
WHERE
    rank_in_category = 3
ORDER BY
    category_id, total_revenue DESC;

Explanation:

  • ProductSales CTE: First, we calculate the total_revenue for each product by joining Products and OrderItems and aggregating.
  • RankedProductSales CTE: We then apply DENSE_RANK() to this aggregated data. PARTITION BY category_id ensures ranking happens independently within each category. ORDER BY total_revenue DESC ranks products by their sales. DENSE_RANK() is chosen so that if multiple products tie for, say, 2nd place, the next distinct rank is 3rd (not 4th, like `RANK()`).
  • Final SELECT: We select the desired columns from RankedProductSales and filter WHERE rank_in_category = 3 to get only the 3rd highest-selling products.

🎯 Practice: Calculating Monthly Growth Percentage

You have a DailySales table with sale_date and total_sales_amount. Calculate the total monthly sales for each month and, for each month, determine the percentage growth or decline compared to the previous month. Assume data exists for multiple consecutive months.

Click for Solution

This problem involves multiple steps: aggregating daily sales to monthly, using `LAG()` to get the previous month's sales, and then calculating the percentage change. This is a common time-series analysis pattern.

-- Assume DailySales table: (sale_date DATE, total_sales_amount DECIMAL)

WITH MonthlySales AS (
    SELECT
        DATE_TRUNC('month', sale_date) AS sales_month, -- PostgreSQL/MySQL; for SQL Server, use DATEFROMPARTS(YEAR(sale_date), MONTH(sale_date), 1)
        SUM(total_sales_amount) AS monthly_total_sales
    FROM
        DailySales
    GROUP BY
        DATE_TRUNC('month', sale_date)
),
LaggedMonthlySales AS (
    SELECT
        sales_month,
        monthly_total_sales,
        LAG(monthly_total_sales, 1) OVER (ORDER BY sales_month ASC) AS previous_month_sales
    FROM
        MonthlySales
)
SELECT
    sales_month,
    monthly_total_sales,
    previous_month_sales,
    CASE
        WHEN previous_month_sales IS NULL THEN NULL -- No previous month to compare
        WHEN previous_month_sales = 0 THEN NULL -- Avoid division by zero if previous month had 0 sales
        ELSE (monthly_total_sales - previous_month_sales) * 100.0 / previous_month_sales
    END AS growth_percentage
FROM
    LaggedMonthlySales
ORDER BY
    sales_month ASC;

Explanation:

  • MonthlySales CTE: Aggregates daily sales into monthly totals. `DATE_TRUNC('month', sale_date)` (or equivalent) truncates the date to the first day of the month, allowing for grouping by month.
  • LaggedMonthlySales CTE: Uses the LAG() window function to fetch the `monthly_total_sales` from the *previous* month. The `ORDER BY sales_month ASC` is crucial for `LAG()` to correctly identify the preceding period.
  • Final SELECT: Calculates the `growth_percentage`. A `CASE` statement is used to handle `NULL` for the first month (no previous month to compare) and to prevent division by zero if a previous month had zero sales.

🎯 Practice: Formatting User Names and Contact Info

You have a Users table with user_id, first_name, last_name, email, and an optional phone_number.

  1. Create a new column called full_name in "LastName, FirstName" format.
  2. Extract the domain from each email address (e.g., "example.com" from "user@example.com").
  3. Display the phone_number, but if it's NULL, show "Contact Email" instead.

Click for Solution

This problem focuses on practical data transformation using string and `NULL` handling functions, commonly needed for reporting or display purposes.

-- Assume Users table: (user_id, first_name, last_name, email, phone_number)

SELECT
    user_id,
    CONCAT(last_name, ', ', first_name) AS full_name, -- 1. Concatenate for full name
    SUBSTRING(email FROM POSITION('@' IN email) + 1) AS email_domain, -- 2. Extract domain (PostgreSQL/MySQL)
    -- For SQL Server: SUBSTRING(email, CHARINDEX('@', email) + 1, LEN(email)) AS email_domain,
    COALESCE(phone_number, 'Contact Email') AS preferred_contact -- 3. Handle NULL phone number
FROM
    Users
ORDER BY
    full_name;

Explanation:

  • CONCAT(last_name, ', ', first_name): Joins the last name, a comma and space, and the first name to create the desired `full_name` format.
  • SUBSTRING(email FROM POSITION('@' IN email) + 1):
    • POSITION('@' IN email) finds the starting position of the '@' symbol.
    • Adding `+ 1` gets the position right after the '@' symbol.
    • SUBSTRING(email FROM [start_position]) extracts the rest of the string from that start position, giving us the domain. (Syntax varies by RDBMS, e.g., SQL Server uses `CHARINDEX` and `LEN`).
  • COALESCE(phone_number, 'Contact Email'): This function returns the first non-`NULL` expression. If `phone_number` is `NULL`, it will return 'Contact Email'; otherwise, it returns the actual `phone_number`.

14. Advanced Topics & ACID

Moving beyond basic querying and data manipulation, understanding transactions, concurrency, and database design principles (normalization) is crucial for building robust, reliable, and performant database systems. These topics are often explored in more advanced SQL interview questions.

The Big Picture: These concepts ensure your data remains correct and accessible, even when multiple users or processes are interacting with it simultaneously.

Transactions & Concurrency

A transaction is a single logical unit of work that accesses and possibly modifies the contents of a database. Transactions are critical for maintaining data integrity, especially in multi-user environments.

ACID Properties: The Pillars of Reliability

The reliability of database transactions is described by the ACID properties:

⚛️ Atomicity (All or Nothing)

A transaction is treated as a single, indivisible unit. Either all of its changes are committed (applied to the database), or none of them are. If any part of the transaction fails, the entire transaction is rolled back, and the database returns to its state before the transaction began.

Example: A bank transfer. Either money is successfully debited from Account A AND credited to Account B, or neither happens.

✅ Consistency (Valid State)

A transaction brings the database from one valid state to another. It ensures that any data written to the database must be valid according to all defined rules (constraints, triggers, cascades). If a transaction attempts to violate these rules, it is rolled back.

Example: An `INSERT` operation must adhere to `NOT NULL`, `UNIQUE`, `FOREIGN KEY` constraints. If not, the transaction fails, preserving consistency.

🔒 Isolation (Transaction Visibility)

Concurrent transactions execute in such a way that the intermediate states of one transaction are not visible to other concurrent transactions. This gives the illusion that transactions are running serially, even if they are executing in parallel.

Example: If User A is updating a record, User B should not see the half-updated (inconsistent) state of that record until User A's transaction is committed.

Isolation Levels: SQL databases offer different isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable) which determine the degree to which transactions are isolated from each other, balancing concurrency with consistency.

💾 Durability (Saved Permanently)

Once a transaction has been committed, its changes are permanent and will persist even in the event of system failures (power loss, crashes). The database management system ensures that committed data is written to non-volatile storage.

Example: After a successful credit card payment is committed, even if the server immediately crashes, the payment record will still be there when the system recovers.

Diagram of a Transaction Lifecycle (Begin → Commit/Rollback)

BEGIN TRANSACTION

(Start a logical unit of work)

Execute SQL Statements

(DML: INSERT, UPDATE, DELETE)

Are all operations successful?
Yes

COMMIT

(Save changes permanently)

No (Error/Failure)

ROLLBACK

(Undo all changes)

-- Example of a transaction
BEGIN TRANSACTION; -- Start the transaction

UPDATE Accounts SET balance = balance - 100 WHERE account_id = 'A123';
INSERT INTO Transactions (from_account, to_account, amount) VALUES ('A123', 'B456', 100);
UPDATE Accounts SET balance = balance + 100 WHERE account_id = 'B456';

-- Check if everything went well (e.g., no negative balances, etc.)
-- IF conditions_met THEN
    COMMIT; -- Save changes to the database
-- ELSE
    ROLLBACK; -- Undo all changes made since BEGIN TRANSACTION
-- END IF;

Normalization (Database Design)

Normalization is a systematic approach to designing relational database schemas to minimize data redundancy and improve data integrity. It involves breaking down a large table into smaller, related tables and defining relationships between them.

  • 1NF (First Normal Form): Atomic values
    • Each column must contain atomic (indivisible) values. No multi-valued attributes or repeating groups.
    • Each row must be unique (have a primary key).
    Violation Example: A 'phone_numbers' column containing "555-1234, 555-5678" for a single person.
  • 2NF (Second Normal Form): No partial dependencies
    • Must be in 1NF.
    • All non-key attributes must be fully functionally dependent on the entire primary key. (This applies only to tables with composite primary keys).
    Violation Example: If `(order_id, product_id)` is a composite PK, and `product_name` depends only on `product_id` (a *part* of the PK), not on `order_id`. This should be in a separate `Products` table.
  • 3NF (Third Normal Form): No transitive dependencies
    • Must be in 2NF.
    • All non-key attributes must be non-transitively dependent on the primary key. This means no non-key attribute should depend on another non-key attribute.
    Violation Example: In an `Employees` table, if `department_name` depends on `department_id`, and `department_id` depends on `employee_id`. Here, `department_name` transitively depends on `employee_id` via `department_id`. `department_name` should be in a separate `Departments` table.
  • Denormalization (When to break rules for performance):

    Sometimes, for specific performance reasons (e.g., complex reporting, reducing joins on read-heavy systems), you might intentionally introduce redundancy into your database design. This is called denormalization.

    • Pros: Faster read queries (fewer joins), simplified queries.
    • Cons: Increased data redundancy, higher risk of data inconsistency, slower write operations (due to updating redundant data).
    Denormalization is a conscious trade-off made after careful performance analysis, not a default design choice.

Practice & Application

🎯 Practice: Simulating an Atomic Bank Transfer (ACID)

You need to transfer $100 from Account A (ID 1) to Account B (ID 2). Write a SQL transaction that ensures either both the debit from A and the credit to B complete successfully, or neither does. If Account A has insufficient funds (less than $100), the entire transfer should be aborted.

Click for Solution

This problem demonstrates the Atomicity and Consistency properties of ACID. The `BEGIN TRANSACTION`, `COMMIT`, and `ROLLBACK` commands are used to ensure the transfer is a single, all-or-nothing operation, and that account balances remain consistent.

-- Assume Accounts table: (account_id INT PRIMARY KEY, balance DECIMAL(10, 2))
-- Initial state: Account 1 has $500, Account 2 has $100.
-- Let's test with Account 1 having $50 for the ROLLBACK case after running the success case.

BEGIN TRANSACTION;

-- Check balance of Account 1
DECLARE @account_A_balance DECIMAL(10, 2);
SELECT @account_A_balance = balance FROM Accounts WHERE account_id = 1;

IF @account_A_balance >= 100 THEN
    -- Debit Account 1
    UPDATE Accounts
    SET balance = balance - 100
    WHERE account_id = 1;

    -- Credit Account 2
    UPDATE Accounts
    SET balance = balance + 100
    WHERE account_id = 2;

    -- All operations successful, commit the transaction
    COMMIT;
    RAISE NOTICE 'Transfer successful!';
ELSE
    -- Insufficient funds, roll back the transaction
    ROLLBACK;
    RAISE NOTICE 'Transfer failed: Insufficient funds in Account 1.';
END IF;

Explanation:

  • BEGIN TRANSACTION marks the start of the atomic unit of work.
  • The `IF` condition checks for sufficient funds.
  • If funds are sufficient, both `UPDATE` statements are executed. If either fails for any reason (e.g., constraint violation, system error), the subsequent `ROLLBACK` (or an implicit rollback due to error) would undo all changes.
  • COMMIT makes all changes permanent, satisfying Durability.
  • If funds are insufficient, `ROLLBACK` is explicitly called, ensuring that no changes are applied to the database, maintaining Atomicity (all or nothing) and Consistency (no account is partially debited/credited).

🎯 Practice: Preventing a Dirty Read (ACID - Isolation)

Describe a scenario where two concurrent transactions could lead to a "Dirty Read" problem. Explain how the default isolation level in many databases prevents this, upholding the Isolation property.

Click for Solution

This example focuses on the conceptual understanding of Isolation and common concurrency issues.

Scenario: Dirty Read

  1. Transaction A starts: A user (e.g., HR) starts updating an employee's salary from $50,000 to $60,000 but has not yet committed the change.
  2. Transaction B starts concurrently: Another user (e.g., Payroll) queries the same employee's salary to calculate their next paycheck.
  3. The "Dirty Read": If Transaction B reads the salary before Transaction A commits, it might see the $60,000 value.
  4. Problem: If Transaction A then encounters an error and rolls back its change (salary reverts to $50,000), Transaction B has made a calculation based on data that never actually existed in the database. This is a "Dirty Read" because Transaction B read uncommitted data.

How Isolation Prevents It (Read Committed)

Most modern relational databases default to an isolation level of Read Committed (or stricter). This level ensures that:

  • A transaction can only read data that has been committed by other transactions.
  • When Transaction A updates the employee's salary to $60,000 but hasn't committed, a lock is typically held on that row (or versioning is used).
  • If Transaction B attempts to read that employee's salary, the database will either wait for Transaction A to commit (or rollback) or will return the *last committed version* of the data (the $50,000 salary), depending on the specific implementation (locking vs. MVCC - Multi-Version Concurrency Control).
  • This ensures that Transaction B never sees the intermediate, uncommitted state of Transaction A, thus preventing the Dirty Read and upholding the Isolation property.

🎯 Practice: Normalizing a Table to 1NF

You have a table named Customers with the following structure and sample data:

CREATE TABLE Customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    customer_address VARCHAR(255),
    phone_numbers VARCHAR(255) -- Stores multiple phone numbers like "555-1234, 555-5678"
);

INSERT INTO Customers VALUES (1, 'Alice', '123 Main St', '555-1234, 555-5678');
INSERT INTO Customers VALUES (2, 'Bob', '456 Oak Ave', '555-9999');

Identify why this table violates First Normal Form (1NF) and propose a normalized 1NF schema with SQL `CREATE TABLE` statements for the corrected structure.

Click for Solution

This problem highlights the concept of atomic values in 1NF, which prohibits multi-valued attributes within a single column.

1NF Violation:

The phone_numbers column violates 1NF because it stores multiple values (a comma-separated list of phone numbers) within a single cell. Each column should contain only atomic (indivisible) values.

Normalized 1NF Schema:

To achieve 1NF, we split the multi-valued `phone_numbers` into a separate, related table, giving each phone number its own row.

-- Original Customers table remains (without phone_numbers column or just with a single main_phone column)
CREATE TABLE Customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    customer_address VARCHAR(255)
);

-- New table for phone numbers
CREATE TABLE CustomerPhones (
    phone_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- Surrogate PK for phone number entry
    customer_id INT NOT NULL,
    phone_number VARCHAR(20) NOT NULL,
    CONSTRAINT FK_CustomerPhone FOREIGN KEY (customer_id) REFERENCES Customers(customer_id),
    CONSTRAINT UQ_CustomerPhoneNumber UNIQUE (customer_id, phone_number) -- A customer can't have the same phone number twice
);

-- Sample Data after normalization:
-- Customers:
-- (1, 'Alice', '123 Main St')
-- (2, 'Bob', '456 Oak Ave')

-- CustomerPhones:
-- (101, 1, '555-1234')
-- (102, 1, '555-5678')
-- (103, 2, '555-9999')

Explanation:

  • The original `Customers` table no longer stores multiple phone numbers in one field.
  • A new table, `CustomerPhones`, is created. Each row in `CustomerPhones` now represents a single, atomic phone number.
  • `customer_id` acts as a foreign key, linking each phone number back to its respective customer.
  • A composite `UNIQUE` constraint `(customer_id, phone_number)` ensures a customer doesn't have the exact same phone number listed multiple times.

🎯 Practice: Normalizing a Table to 2NF

You have an OrderDetails table with a composite primary key (order_id, product_id):

CREATE TABLE OrderDetails (
    order_id INT,
    product_id INT,
    product_name VARCHAR(100), -- Product name depends only on product_id
    quantity INT,
    price_per_unit DECIMAL(10, 2),
    PRIMARY KEY (order_id, product_id)
);

INSERT INTO OrderDetails VALUES (101, 1, 'Laptop', 1, 1200.00);
INSERT INTO OrderDetails VALUES (101, 2, 'Mouse', 2, 25.00);
INSERT INTO OrderDetails VALUES (102, 1, 'Laptop', 1, 1200.00);

Identify why this table violates Second Normal Form (2NF) and propose a normalized 2NF schema with SQL `CREATE TABLE` statements.

Click for Solution

This problem addresses 2NF, which deals with partial dependencies when a table has a composite primary key.

2NF Violation:

The table is in 1NF (all values are atomic). However, it violates 2NF because the non-key attribute `product_name` is partially dependent on the primary key. `product_name` depends only on `product_id` (a part of the composite primary key), not on the full `(order_id, product_id)` combination.

This leads to redundancy: `Laptop` and its price are repeated for every order that includes `product_id = 1`.

Normalized 2NF Schema:

To achieve 2NF, we decompose the table into two tables: one for order item details and one for product details.

-- Original OrderDetails table modified to store only order-specific item details
CREATE TABLE OrderItems (
    order_id INT,
    product_id INT,
    quantity INT,
    price_per_unit DECIMAL(10, 2), -- This price might vary per order, so keep it here
    PRIMARY KEY (order_id, product_id),
    CONSTRAINT FK_OrderItemOrder FOREIGN KEY (order_id) REFERENCES Orders(order_id), -- Assuming an Orders table exists
    CONSTRAINT FK_OrderItemProduct FOREIGN KEY (product_id) REFERENCES Products(product_id) -- References the new Products table
);

-- New Products table for product-specific information
CREATE TABLE Products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100) NOT NULL UNIQUE
);

-- Sample Data after normalization:
-- OrderItems:
-- (101, 1, 1, 1200.00)
-- (101, 2, 2, 25.00)
-- (102, 1, 1, 1200.00)

-- Products:
-- (1, 'Laptop')
-- (2, 'Mouse')

Explanation:

  • The `Products` table now stores product-specific attributes (`product_name`) that depend solely on `product_id`.
  • The `OrderItems` table maintains the composite primary key `(order_id, product_id)` and stores attributes that fully depend on this key (`quantity`, `price_per_unit`).
  • `product_id` in `OrderItems` acts as a foreign key to `Products`, linking the two tables. This eliminates the redundancy of `product_name` and `product_price` in the `OrderItems` table.

🎯 Practice: Normalizing a Table to 3NF

You have an Employees table (in 2NF) with the following structure and sample data:

CREATE TABLE Employees (
    employee_id INT PRIMARY KEY,
    employee_name VARCHAR(100),
    department_id INT,
    department_name VARCHAR(100) -- Department name depends on department_id
);

INSERT INTO Employees VALUES (1, 'Alice', 10, 'Sales');
INSERT INTO Employees VALUES (2, 'Bob', 20, 'Marketing');
INSERT INTO Employees VALUES (3, 'Charlie', 10, 'Sales');

Identify why this table violates Third Normal Form (3NF) and propose a normalized 3NF schema with SQL `CREATE TABLE` statements.

Click for Solution

This problem addresses 3NF, which deals with transitive dependencies, where a non-key attribute depends on another non-key attribute.

3NF Violation:

The table is in 2NF. However, it violates 3NF because `department_name` (a non-key attribute) is transitively dependent on `employee_id` (the primary key) through `department_id` (another non-key attribute). In other words, `department_name` depends on `department_id`, which in turn depends on `employee_id`.

This leads to redundancy: `Sales` and `Marketing` department names are repeated for every employee in those departments.

Normalized 3NF Schema:

To achieve 3NF, we extract the `department_id` and `department_name` into a separate `Departments` table.

-- Employees table
CREATE TABLE Employees (
    employee_id INT PRIMARY KEY,
    employee_name VARCHAR(100),
    department_id INT,
    CONSTRAINT FK_EmployeeDepartment FOREIGN KEY (department_id) REFERENCES Departments(department_id)
);

-- New Departments table
CREATE TABLE Departments (
    department_id INT PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL UNIQUE
);

-- Sample Data after normalization:
-- Employees:
-- (1, 'Alice', 10)
-- (2, 'Bob', 20)
-- (3, 'Charlie', 10)

-- Departments:
-- (10, 'Sales')
-- (20, 'Marketing')

Explanation:

  • The `Departments` table now stores department-specific attributes (`department_name`) that depend solely on `department_id`.
  • The `Employees` table retains `employee_id` (PK), `employee_name`, and `department_id`.
  • `department_id` in `Employees` acts as a foreign key to `Departments`, linking employees to their departments.
  • This eliminates the redundancy of department names and ensures that if a department name changes, it only needs to be updated in one place.

🎯 Practice: Denormalization for Reporting Performance

You have a fully normalized database with Customers (customer_id, customer_name), Orders (order_id, customer_id, order_date), and Products (product_id, product_name). For a daily sales report, the business frequently needs to see customer_name, order_date, and product_name for every item sold.

While this can be done with joins, explain a scenario where you might consider denormalizing this data into a dedicated reporting table, and what the trade-offs would be.

Click for Solution

This problem explores the practical reasons for denormalization, a common strategy in data warehousing and OLAP systems where read performance outweighs the strict adherence to normalization rules.

Scenario for Denormalization:

For a highly frequented daily sales report, users constantly query customer names, order dates, and product names. In a fully normalized schema, this would require joining at least three tables (`Customers`, `Orders`, `OrderItems` (which links to `Products`)). For millions of orders, these joins can become computationally expensive and slow for repeated queries.

A common denormalization strategy here would be to create a "fact table" or a "summary table" in a data warehouse context. For instance, a `SalesFact` table.

Proposed Denormalized Structure (Conceptual `SalesFact` Table):

Instead of linking via foreign keys and performing joins on the fly, the `SalesFact` table could directly include redundant `customer_name` and `product_name` columns:

-- Denormalized table for reporting (created/refreshed daily)
CREATE TABLE SalesFact_Denormalized (
    sales_fact_id INT PRIMARY KEY,
    order_id INT,
    order_date DATE,
    customer_id INT,
    customer_name VARCHAR(100), -- Redundant: Copied from Customers
    product_id INT,
    product_name VARCHAR(100), -- Redundant: Copied from Products
    quantity INT,
    total_item_price DECIMAL(10, 2)
);

-- Data for this table would typically be populated via ETL processes
-- (Extract, Transform, Load) from the normalized OLTP system.

Trade-offs:

  • Pros of Denormalization:
    • Faster Read Performance: Queries for the daily sales report would involve reading from a single, wider table, eliminating complex joins and drastically improving query speed.
    • Simpler Queries: Reporting queries become much simpler, requiring fewer joins and less complex logic.
    • Pre-aggregated Data: Can also pre-aggregate common metrics (e.g., daily sales by product, by customer) to serve frequently requested reports even faster.
  • Cons of Denormalization:
    • Increased Data Redundancy: `customer_name` and `product_name` are repeated for every order item, consuming more storage.
    • Higher Risk of Data Inconsistency: If a customer's name changes in the `Customers` table, the `SalesFact_Denormalized` table might become outdated until it's refreshed. More effort is needed to keep redundant data synchronized.
    • Slower Write/Update Performance: If `SalesFact_Denormalized` were directly updated in real-time, any changes to a customer's or product's core data would require updates in multiple places. (Typically, this table is rebuilt/refreshed, not updated row-by-row).
    • Increased ETL Complexity: A separate ETL (Extract, Transform, Load) process is needed to populate and maintain the denormalized table from the normalized operational database.

Conclusion: Denormalization is a strategic decision, often made in analytical or reporting databases (OLAP systems) where read-heavy workloads demand speed, and the operational database (OLTP system) maintains the normalized, consistent source of truth.

15. Exam & Interview Strategy Guide

You've covered a comprehensive range of SQL topics, from fundamentals to advanced concepts. Now, let's equip you with a robust strategy for approaching SQL interview questions and technical exams confidently.

Mastering the Art: It's not just about knowing the syntax; it's about a systematic approach to problem-solving, effective debugging, and meticulous review.

Problem Solving Framework

Adopt a structured approach to tackle any SQL problem. This framework helps break down complex tasks into manageable steps.

1. Understand Inputs/Outputs

What tables, columns, and sample data are you given? What is the exact expected output (columns, aggregation level, filters)? Clarify any ambiguities!

2. Identify the Grain of the Data

What does each row in your *final* output represent? Are you returning individual records, or aggregated summaries (e.g., one row per customer, one row per day)? This dictates `GROUP BY` strategy.

3. Write the Skeleton Query

Start with the fundamental structure. Which tables do you need? What joins? What's the initial `SELECT` list? Don't worry about complexity yet.

SELECT
  -- ... columns ...
FROM
  Table1
JOIN
  Table2 ON ...
WHERE
  -- ... basic filters ...
GROUP BY
  -- ... grouping columns ...
HAVING
  -- ... group filters ...
ORDER BY
  -- ... sorting ...

4. Fill in the Logic

Incrementally add `WHERE` conditions, `CASE` statements, window functions, and more complex logic. Test each step if possible. Use CTEs to break down intricate logic.

Debugging Your Own Code

Even the best developers write bugs. Knowing how to find and fix them efficiently is a valuable skill.

  • 🛠️ Check Intermediate Results with CTEs: If your query is long or complex, use CTEs to isolate different parts of the logic. Select from each CTE individually to verify that it's producing the expected intermediate data. This is far easier than trying to debug a single, monolithic query.
  • 🛠️ Verify Row Counts Before/After Joins: The "fan-out" problem (Section 7) is a common source of errors. Before and after each `JOIN` or `UNION`, check the `COUNT(*)` to ensure your row count changes as expected. Unexpected increases often point to incorrect join conditions or implicit Cartesian products.
  • 🛠️ Simplify, then Rebuild: If a query is not working, remove all non-essential clauses (ORDER BY, complex `WHERE` conditions, extra columns). Get the simplest core query working, then gradually add back complexity, testing at each step.

Review Checklist

Before submitting your solution or declaring it "done," run through these critical checks:

  • Did I handle NULLs? This is a common pitfall. Are `NULL`s correctly filtered (IS NULL/IS NOT NULL), sorted (NULLS FIRST/LAST), or replaced (COALESCE)?
  • Did I handle duplicates? Does the problem require unique records (`DISTINCT`, `GROUP BY`, `UNION`) or are duplicates expected (`UNION ALL`)? Are aggregations (like `COUNT`) impacted by unintended duplicates?
  • Is the syntax dialect-specific? Be aware of differences between T-SQL (SQL Server), PostgreSQL, MySQL, Oracle SQL. Functions like `DATE_TRUNC`, `DATEDIFF`, `CONCAT`, and auto-incrementing primary keys vary. State your assumed dialect if not specified.
  • Is the query efficient? For interviews, mentioning potential optimizations (indexes, avoiding correlated subqueries, using `EXISTS` instead of `IN` for subqueries if `NULL`s are an issue) can impress.
  • Is the code readable? Use aliases, consistent capitalization, and proper indentation. Comments can explain complex logic.

Resources

Practice is paramount! Here are some recommended platforms for honing your SQL skills:

  • 📚 LeetCode: Excellent for algorithmic SQL problems, covers a wide range of difficulty.
  • 📚 HackerRank: Good for beginner to intermediate SQL challenges.
  • 📚 StrataScratch: Focuses on real-world data science interview questions, often from top tech companies.
  • 📚 SQLZoo: Interactive tutorials and exercises, great for learning basics.
  • 📚 Kaggle Notebooks: Explore public datasets and see how others solve problems with SQL.

Homework / Challenges

🎯 Challenge: High-Value Customer Segmentation

You have access to three tables:

  • Customers: (customer_id (PK), customer_name, registration_date)
  • Orders: (order_id (PK), customer_id (FK), order_date, total_amount)
  • OrderItems: (order_item_id (PK), order_id (FK), product_id (FK), quantity)

Your task is to identify "High-Value Customers". A customer is considered "High-Value" if they meet both of these criteria:

  1. Their total lifetime spending is greater than the overall average total spending per customer.
  2. They have placed a total number of orders greater than the overall average number of orders per customer.

For each High-Value customer, return their customer_name, their total lifetime spending, total number of orders, and the count of distinct products they've purchased.

Click to see Solution

Step 1: Calculate Customer Lifetime Value (Total Spending, Total Orders, Distinct Products)

First, we need to get the lifetime metrics for each customer. This involves joining Customers with Orders and OrderItems, then grouping by customer to aggregate their spending, order count, and unique product purchases.

WITH CustomerLifetimeMetrics AS (
    SELECT
        C.customer_id,
        C.customer_name,
        SUM(O.total_amount) AS total_spending,
        COUNT(DISTINCT O.order_id) AS total_orders,
        COUNT(DISTINCT OI.product_id) AS distinct_products_purchased
    FROM
        Customers C
    JOIN
        Orders O ON C.customer_id = O.customer_id
    JOIN
        OrderItems OI ON O.order_id = OI.order_id
    GROUP BY
        C.customer_id, C.customer_name
)
SELECT * FROM CustomerLifetimeMetrics; -- Intermediate result check

Step 2: Calculate Overall Average Spending and Order Count per Customer

Next, we need the company-wide averages for total spending and total orders per customer. These will be scalar values that we can use for comparison.

-- This could be another CTE or derived table
WITH CustomerLifetimeMetrics AS (
    -- ... (Same as Step 1)
),
OverallAverages AS (
    SELECT
        AVG(total_spending) AS avg_customer_spending,
        AVG(total_orders) AS avg_customer_orders
    FROM
        CustomerLifetimeMetrics
)
SELECT * FROM OverallAverages; -- Intermediate result check

Step 3: Combine and Filter for High-Value Customers

Finally, we combine these CTEs. We'll select from CustomerLifetimeMetrics and join it (conceptually, or cross join if overall averages are pre-calculated) or simply use the overall averages in the WHERE clause as scalar subqueries. We apply the "High-Value" criteria in the WHERE clause.

WITH CustomerLifetimeMetrics AS (
    SELECT
        C.customer_id,
        C.customer_name,
        SUM(O.total_amount) AS total_spending,
        COUNT(DISTINCT O.order_id) AS total_orders,
        COUNT(DISTINCT OI.product_id) AS distinct_products_purchased
    FROM
        Customers C
    JOIN
        Orders O ON C.customer_id = O.customer_id
    JOIN
        OrderItems OI ON O.order_id = OI.order_id
    GROUP BY
        C.customer_id, C.customer_name
),
OverallAverages AS (
    SELECT
        AVG(total_spending) AS avg_customer_spending,
        AVG(total_orders) AS avg_customer_orders
    FROM
        CustomerLifetimeMetrics
)
SELECT
    CLM.customer_name,
    CLM.total_spending,
    CLM.total_orders,
    CLM.distinct_products_purchased
FROM
    CustomerLifetimeMetrics CLM, OverallAverages OA -- Using implicit CROSS JOIN or just reference scalar values
WHERE
    CLM.total_spending > OA.avg_customer_spending
    AND CLM.total_orders > OA.avg_customer_orders
ORDER BY
    CLM.total_spending DESC;

Conclusion: This solution effectively segments customers by first computing individual customer metrics, then calculating company-wide averages, and finally filtering based on those averages. The use of CTEs makes the multi-step logic clear and readable.

🎯 Challenge: Longest Consecutive Daily Login Streak

You have a table named UserLogins with the following columns: (login_id (PK), user_id (FK), login_timestamp (TIMESTAMP)).

Your goal is to determine the longest consecutive daily login streak for each user. A "daily login" means a login occurred on a particular calendar day. A streak is considered consecutive if each day's login is within 24 hours of the previous day's earliest login for that user. If a user logs in multiple times on the same day, it counts as one day for the streak.

Return user_id and their longest_streak_days.

Click to see Solution

This is a classic and challenging problem that requires a combination of date functions, window functions (especially LAG() and `ROW_NUMBER()`), and multiple CTEs to isolate distinct login days and then group consecutive days.

Step 1: Identify Unique Daily Logins for Each User

First, we need to ensure each day is counted only once, even if a user logs in multiple times on that day. We'll extract the date part of the `login_timestamp` and get distinct entries.

WITH DistinctDailyLogins AS (
    SELECT
        user_id,
        CAST(login_timestamp AS DATE) AS login_day -- Extract just the date part (syntax varies by RDBMS)
        -- For MySQL: DATE(login_timestamp)
        -- For SQL Server: CAST(login_timestamp AS DATE)
    FROM
        UserLogins
    GROUP BY
        user_id, CAST(login_timestamp AS DATE) -- Group to get distinct login days
),
SELECT * FROM DistinctDailyLogins; -- Intermediate result check

Step 2: Calculate the Difference in Days from a "Base" Date (The Gaps-and-Islands Method)

This is the core trick: we assign a `ROW_NUMBER()` to each login day for a user. Then, we subtract this `ROW_NUMBER()` from the actual `login_day`. If the result (the "streak_group") is the same for consecutive days, it means those days are part of the same streak, even if there are gaps in the actual `login_day` values themselves. This creates a "group identifier" for each streak.

WITH DistinctDailyLogins AS (
    -- ... (Same as Step 1)
),
StreakGroups AS (
    SELECT
        user_id,
        login_day,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) AS rn,
        CAST(login_day AS DATE) - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) * INTERVAL '1 DAY') AS streak_group
        -- The above line calculates the difference from a 'virtual' continuous sequence.
        -- If login_day is 2023-01-05 and rn is 5, streak_group is 2023-01-01.
        -- If next login_day is 2023-01-06 and rn is 6, streak_group is 2023-01-01. (Same group)
        -- If next login_day is 2023-01-08 and rn is 7, streak_group is 2023-01-01. (Still same group due to direct date arithmetic)
        -- This isn't quite right for 'more than 24 hours'. Let's adjust for DATEDIFF for cleaner streak grouping.

        -- Corrected for DATEDIFF logic:
        -- DATEDIFF (in days) from an arbitrary fixed date.
        -- Then subtract the ROW_NUMBER to get the streak_group.
        -- Example (PostgreSQL DATEDIFF-equivalent):
        -- CAST(login_day AS DATE) - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) * INTERVAL '1 DAY') AS streak_group_identifier
        -- Or simply, as an integer representing the group:
        -- (EXTRACT(EPOCH FROM login_day) / 86400) - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) AS streak_group_id
        -- Let's use `login_day - INTERVAL '1 day' * (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day))` which works in PostgreSQL.
        -- For SQL Server/MySQL, convert date to integer (e.g., YYYYMMDD) or use DATEDIFF from a base date.
        -- Simpler `login_day - (row_number() over (partition by user_id order by login_day)) * INTERVAL '1 day'` should work on most `DATE` types
    FROM
        DistinctDailyLogins
)
SELECT * FROM StreakGroups; -- Intermediate result check

Step 3: Calculate the Length of Each Streak

Now that we have grouped consecutive login days, we can simply count the number of days within each `streak_group` for each user.

WITH DistinctDailyLogins AS (
    -- ... (Same as Step 1)
),
StreakGroups AS (
    SELECT
        user_id,
        login_day,
        -- Using an expression that results in the same value for consecutive days, but changes when a gap occurs
        -- Example (PostgreSQL-like syntax for date arithmetic):
        login_day - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) * INTERVAL '1 DAY') AS streak_identifier
        -- For SQL Server: DATEADD(day, -1 * (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day)), login_day) AS streak_identifier
        -- For MySQL: DATE_SUB(login_day, INTERVAL (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day)) DAY) AS streak_identifier
    FROM
        DistinctDailyLogins
),
StreakLengths AS (
    SELECT
        user_id,
        streak_identifier,
        COUNT(*) AS current_streak_length
    FROM
        StreakGroups
    GROUP BY
        user_id, streak_identifier
)
SELECT * FROM StreakLengths; -- Intermediate result check

Step 4: Find the Longest Streak for Each User

Finally, for each user, we find the maximum `current_streak_length` from all their identified streaks.

WITH DistinctDailyLogins AS (
    SELECT
        user_id,
        CAST(login_timestamp AS DATE) AS login_day -- Using CAST(date AS DATE) for cross-RDBMS clarity
    FROM
        UserLogins
    GROUP BY
        user_id, CAST(login_timestamp AS DATE)
),
StreakGroups AS (
    SELECT
        user_id,
        login_day,
        -- This calculation creates a common "group" for consecutive days.
        -- If a day is consecutive, the difference between its `login_day` and its `row_number` will be constant.
        CAST(login_day AS DATE) - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) * INTERVAL '1 DAY') AS streak_identifier
        -- Adjust INTERVAL '1 DAY' type based on RDBMS. Some might need `NUMERIC_DATE - ROW_NUMBER()`
    FROM
        DistinctDailyLogins
),
StreakLengths AS (
    SELECT
        user_id,
        streak_identifier,
        COUNT(*) AS current_streak_length
    FROM
        StreakGroups
    GROUP BY
        user_id, streak_identifier
)
SELECT
    user_id,
    MAX(current_streak_length) AS longest_streak_days
FROM
    StreakLengths
GROUP BY
    user_id
ORDER BY
    user_id;

Conclusion: This complex problem showcases the power of chained CTEs and window functions, particularly the "Gaps and Islands" method, to identify and measure consecutive sequences in time-series data. By creating a `streak_identifier` that remains constant for consecutive days and changes for non-consecutive days, we effectively group the streaks and then calculate their lengths.

Post a Comment

Previous Post Next Post