1. Introduction to SQL Interviews
Purpose of SQL Interviews
Ever wondered why companies are so keen on your SQL skills? It's more than just knowing a few commands!
There's a constant trade-off in the real world between writing a query quickly and ensuring it performs optimally. Interviewers want to see how you navigate this spectrum.
Consider the progression of SQL proficiency:
Syntax Recall
Can you write basic SELECT, FROM, WHERE?
Logical Problem Solving
Can you translate business questions into query logic using joins, aggregations, subqueries?
Query Optimization
Can you write efficient queries that scale, considering indexes, execution plans, and data volume?
Most interviews target the Logical Problem Solving stage, often hinting at the need for Query Optimization.
Types of SQL Questions Asked
SQL interviews aren't one-size-fits-all. You might encounter a variety of question formats:
📝 Definition-based (Theory)
Questions about core concepts like ACID properties, Normalization Forms, types of keys, or the difference between `DELETE` and `TRUNCATE`.
💻 Query Writing (Whiteboard/Live coding)
The most common type. You'll be given a schema and a business problem, then asked to write a SQL query to solve it.
📐 Database Design (Schema/Modeling)
You might need to design a simple database schema given a set of requirements, identifying tables, columns, and relationships.
🔍 Debugging/Optimization
Presented with an existing (often inefficient or incorrect) query, you'll be asked to fix it, optimize it, or explain its flaws.
Strategies for Answering Effectively
Beyond knowing SQL, how you approach and articulate your solution can make a huge difference.
- ✅ Clarify Requirements: Before writing any code, ask clarifying questions. What are the expected inputs and outputs? Are there any edge cases?
- ✅ State Assumptions: Explicitly mention any assumptions you're making about the data (e.g., "I'm assuming
user_idsare unique and non-NULL in theUserstable."). This shows thoughtful consideration. - ✅ Formatting and Readability: Write clean, well-formatted SQL. Use consistent capitalization, proper indentation, and comments where necessary. A readable query is easier to debug and understand.
- ✅ Think Out Loud: Walk the interviewer through your thought process. Explain *why* you're choosing a `LEFT JOIN` over an `INNER JOIN`, or *why* you're using a CTE.
2. SQL Fundamentals, Prerequisites & Definitions
Relational Model Foundations
At the heart of SQL lies the Relational Model, a way of organizing data into structured tables. Understanding these foundational terms is crucial.
Let's break down the core components:
Anatomy of a Relational Table
| user_id (PK) | username | registration_date | |
|---|---|---|---|
| 101 | alice_smith | alice@example.com | 2023-01-15 |
| 102 | bob_jones | bob@example.com | 2023-01-18 |
| 103 | charlie_brown | charlie@example.com | 2023-02-01 |
- Database: The overarching container for all your schemas, tables, views, etc. (e.g., "MyCompanyDB").
- Schema: A logical collection of database objects, often used to group related tables or define access permissions. (e.g., "public", "sales_data").
- Table (Relation): The primary structure where data is stored, consisting of rows and columns. The entire structure above is a table.
-
Row (Tuple/Record): A single entry or instance of data in a table. In the table above,
(101, alice_smith, alice@example.com, 2023-01-15)is one row. -
Column (Attribute/Field): A specific type of data that each row contains.
username,email,registration_dateare columns. - Sets: In the relational model, tables are often considered "sets" of rows (tuples). Each row is unique, and the order of rows doesn't inherently matter.
Data Types & Nuances
Choosing the right data type for each column is crucial for data integrity, storage efficiency, and query performance.
- 🔢 Numeric:
INT: Whole numbers (e.g., 1, 1000).BIGINT: Larger whole numbers.FLOAT/REAL: Approximate decimal numbers (less precise, good for scientific data).DECIMAL/NUMERIC/MONEY: Exact decimal numbers (crucial for financial data, precise calculations).
- 📝 String:
CHAR(n): Fixed-length string. If you store 'hi' in `CHAR(10)`, it still takes 10 bytes (padded with spaces). Faster for fixed-size data.VARCHAR(n): Variable-length string. Stores only what's needed (plus a small overhead). More common.TEXT: Very large variable-length strings, no practical length limit in many systems.
- ⏰ Date/Time:
DATE: Stores only the date (YYYY-MM-DD).TIME: Stores only the time (HH:MM:SS).TIMESTAMP: Stores date and time.TIMESTAMPTZ(PostgreSQL)/DATETIME2(SQL Server): Stores date and time with timezone awareness. Crucial for applications spanning multiple timezones.
- ✅ Boolean:
BOOLEAN(PostgreSQL)/BIT(SQL Server)/TINYINT(1)(MySQL): Stores true/false values.
The Concept of NULL
NULL is one of the most misunderstood concepts in SQL. It does not mean zero, an empty string, or false.
- Definition:
NULLrepresents missing or unknown data. - Missing vs. Empty vs. Zero:
NULL: Data is unknown or not applicable.''(empty string): Data is known, but it's an empty text value.0(zero): Data is known, and its numerical value is zero.
The presence of NULL introduces "Three-Valued Logic" into SQL: TRUE, FALSE, and UNKNOWN. This has significant implications for conditions in `WHERE` and `HAVING` clauses.
Truth Table: AND/OR Interactions with NULL
| Condition A | Operator | Condition B | Result (A AND B) | Result (A OR B) |
|---|---|---|---|---|
| TRUE | AND | TRUE | TRUE | TRUE |
| TRUE | AND | FALSE | FALSE | TRUE |
| TRUE | AND | NULL | UNKNOWN | TRUE |
| FALSE | AND | NULL | FALSE | UNKNOWN |
| NULL | AND | NULL | UNKNOWN | UNKNOWN |
| FALSE | OR | FALSE | FALSE | FALSE |
| NULL | OR | TRUE | TRUE | TRUE |
= NULL or != NULL. These comparisons always result in UNKNOWN. You must use IS NULL or IS NOT NULL to correctly filter for or against NULL values.
Keys & Relationships
Keys define relationships between tables and enforce data integrity.
🔑 Primary Key (PK)
- Uniquely identifies each row in a table.
- Must contain unique values and cannot be
NULL. - Natural Key: A column that naturally exists and is unique (e.g., `ISBN` for a book).
- Surrogate Key: An artificial key, often an auto-incrementing integer, created specifically to be the PK (e.g., `user_id`). Recommended for flexibility.
🔗 Foreign Key (FK)
- A column (or set of columns) in one table that refers to the Primary Key of another table.
- Establishes and enforces a link between the data in two tables.
- Ensures referential integrity: you can't have an order for a customer that doesn't exist.
🧩 Composite Key
- A Primary Key consisting of two or more columns whose values, when combined, uniquely identify each row in a table.
- Used when no single column can guarantee uniqueness.
Conceptual ER Diagram: Customers & Orders
Customers Table
customer_id (PK, INT) 🔑
first_name (VARCHAR)
last_name (VARCHAR)
email (VARCHAR, UNIQUE)
Orders Table
order_id (PK, INT) 🔑
customer_id (FK, INT) 🔗
order_date (DATE)
total_amount (DECIMAL)
customer_id in the Orders table is a Foreign Key that links back to the customer_id Primary Key in the Customers table. This ensures every order belongs to an existing customer.
SQL Command Categories (The Big Picture)
SQL commands are broadly categorized by their purpose:
⚙️ DDL (Data Definition Language)
Commands that define or manage the database structure.
CREATE(DATABASE, TABLE, INDEX, VIEW)ALTER(TABLE, INDEX, VIEW)DROP(DATABASE, TABLE, INDEX, VIEW)TRUNCATE(TABLE - resets data, keeps structure)RENAME
💾 DML (Data Manipulation Language)
Commands that manipulate the data within the database objects.
SELECT(Retrieve data)INSERT(Add new data)UPDATE(Modify existing data)DELETE(Remove existing data)
🔒 DCL (Data Control Language)
Commands that manage permissions and access control to the database.
GRANT(Give user permissions)REVOKE(Remove user permissions)
🔄 TCL (Transaction Control Language)
Commands that manage transactions in the database.
COMMIT(Save changes permanently)ROLLBACK(Undo changes)SAVEPOINT(Set a point within a transaction to roll back to)
3. Deep Dive: Order of Execution (Crucial Concept)
Understanding the order in which a SQL query is processed by the database engine is one of the most fundamental and often overlooked concepts. It's the key to writing correct, efficient, and debuggable queries.
Syntactical Order (How you write it)
This is the sequence in which you typically type out your SQL query. It's designed for human readability and clarity.
Logical Order (How the engine runs it)
This is the internal sequence the database follows to process your query. Each step produces an intermediate result set that is passed to the next step. Understanding this is vital for predicting query behavior.
SELECT don't work in WHERE
Notice that SELECT (step 5) occurs after WHERE (step 2). This means any column aliases defined in your SELECT list are not yet "known" to the database engine when it processes the WHERE clause. You cannot filter by an alias defined in the same query's SELECT statement within its WHERE clause.
Example of an invalid query (due to logical order):
SELECT
first_name,
last_name,
(salary * 1.10) AS adjusted_salary
FROM
employees
WHERE
adjusted_salary > 60000; -- ERROR: 'adjusted_salary' is unknown here!
To achieve this, you would either repeat the calculation in the WHERE clause, use a subquery, or a Common Table Expression (CTE).
Visual Comparison: Written Order vs. Execution Order
📝 Written Order (Human Perspective)
🚀 Logical Execution Order (Database Perspective)
Practice & Application
🎯 Practice: Filtering by a Calculated Column (Alias Issue)
You need to find all employees whose estimated annual bonus, calculated as 10% of their salary, is greater than $5,000. Try to use an alias for the bonus calculation in your SELECT statement and then filter by it.
Click for Solution
This problem directly tests your understanding of the SQL query execution order. The WHERE clause is processed before the SELECT clause. Therefore, any alias defined in SELECT is not yet known when the WHERE clause attempts to filter by it, leading to an error.
Incorrect Attempt (will fail):
SELECT
first_name,
last_name,
salary,
(salary * 0.10) AS annual_bonus
FROM
Employees
WHERE
annual_bonus > 5000; -- ERROR: 'annual_bonus' is unknown here!
Correct Solution: To fix this, you must either repeat the calculation in the WHERE clause or use a subquery/CTE where the calculated column is established in an inner query and then filtered in an outer query.
-- Option 1: Repeat the calculation (simplest for single use)
SELECT
first_name,
last_name,
salary,
(salary * 0.10) AS annual_bonus
FROM
Employees
WHERE
(salary * 0.10) > 5000;
-- Option 2: Using a Subquery (more maintainable for complex calculations)
SELECT
first_name,
last_name,
salary,
annual_bonus
FROM
(SELECT
first_name,
last_name,
salary,
(salary * 0.10) AS annual_bonus
FROM
Employees
) AS subquery_with_bonus
WHERE
annual_bonus > 5000;
🎯 Practice: Filtering Rows vs. Filtering Groups
You need to find the average salary for each department, but only for employees hired after January 1, 2020. Additionally, you only want to see departments where this calculated average salary is greater than $65,000.
Click for Solution
This problem highlights the difference between WHERE and HAVING. WHERE filters individual rows before any grouping or aggregation takes place. HAVING filters entire groups after aggregation.
SELECT
department,
AVG(salary) AS avg_dept_salary
FROM
Employees
WHERE
hire_date > '2020-01-01' -- Filters individual rows BEFORE grouping
GROUP BY
department
HAVING
AVG(salary) > 65000; -- Filters groups AFTER aggregation
Explanation:
- The
FROMclause gathers all employee data. - The
WHEREclause removes any employee records hired on or before '2020-01-01'. - The
GROUP BYclause then aggregates the remaining employees by department. - The
HAVINGclause finally filters these aggregated groups, keeping only those departments whose average salary (of employees hired after 2020) exceeds $65,000. - Finally,
SELECTprojects the results.
🎯 Practice: Ordering by a Derived Column Alias
List all employees, showing their full name (first_name concatenated with last_name) and their salary. Order the results alphabetically by the full name.
Click for Solution
Unlike the WHERE clause, the ORDER BY clause is processed after the SELECT clause. This means any column aliases defined in your SELECT list are fully accessible and usable within the ORDER BY clause.
SELECT
first_name,
last_name,
CONCAT(first_name, ' ', last_name) AS full_name,
salary
FROM
Employees
ORDER BY
full_name ASC; -- 'full_name' alias is recognized here!
Explanation:
- The
FROMclause retrieves all employee data. - The
SELECTclause calculates and aliases thefull_name. - The
ORDER BYclause then takes this result set and sorts it alphabetically using the newly createdfull_namecolumn.
🎯 Practice: Retrieving Top N Ordered Results
Find the 3 most recently hired employees. Include their full name, department, salary, and hire date.
Click for Solution
This problem demonstrates the typical sequence of `ORDER BY` followed by `LIMIT` (or `TOP` in SQL Server). The data is first sorted according to the criteria, and then the specified number of rows is taken from the top of the sorted list.
SELECT
CONCAT(first_name, ' ', last_name) AS full_name,
department,
salary,
hire_date
FROM
Employees
ORDER BY
hire_date DESC -- Sort by hire date in descending order (most recent first)
LIMIT 3; -- Get only the top 3 rows from the sorted result set
Explanation:
- The
FROMandSELECTclauses prepare the initial set of data and columns. - The
ORDER BY hire_date DESCclause arranges all employees from most recently hired to oldest hired. - Finally, the
LIMIT 3clause takes only the first three rows from this now-sorted list, giving you the 3 most recently hired employees.
4. Data Definition Language (DDL)
Data Definition Language (DDL) commands are used to define, modify, and manage the structure of database objects like databases, tables, indexes, and views. They deal with the schema rather than the data itself.
Creating Structures
The foundation of any database interaction begins with defining its structure.
CREATE DATABASE
This command is used to create a new SQL database instance.
CREATE DATABASE MyCompanyDB;
CREATE TABLE
This is where you define the blueprint for your data, specifying column names, data types, and constraints.
Defining Constraints Inline vs. Externally:
Constraints (like Primary Keys, Foreign Keys, NOT NULL, UNIQUE, CHECK, DEFAULT) can be defined as part of the column definition (inline) or separately at the table level (external).
CREATE TABLE Employees (
employee_id INT PRIMARY KEY, -- Inline PRIMARY KEY
first_name VARCHAR(50) NOT NULL, -- Inline NOT NULL
last_name VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE, -- Inline UNIQUE
hire_date DATE DEFAULT CURRENT_DATE, -- Inline DEFAULT
salary DECIMAL(10, 2) CHECK (salary > 0), -- Inline CHECK
department_id INT,
-- External FOREIGN KEY definition
CONSTRAINT fk_department
FOREIGN KEY (department_id) REFERENCES Departments(department_id)
);
CREATE TABLE Departments (
department_id INT PRIMARY KEY,
department_name VARCHAR(100) NOT NULL UNIQUE
);
Why choose one over the other?
- Inline: Simpler for single-column constraints.
- External: Required for multi-column (composite) constraints (e.g., `PRIMARY KEY (col1, col2)`), and often preferred for foreign keys for clarity and explicit naming.
Modifying Structures (`ALTER`)
Over time, your database schema will likely need adjustments. The `ALTER TABLE` command is your tool for this.
Adding vs. Dropping Columns
-- Add a new column to the Employees table
ALTER TABLE Employees
ADD COLUMN phone_number VARCHAR(20);
-- Drop an existing column from the Employees table
ALTER TABLE Employees
DROP COLUMN phone_number;
Changing Data Types
-- Change the data type of the email column (syntax varies by RDBMS)
-- PostgreSQL:
ALTER TABLE Employees
ALTER COLUMN email TYPE VARCHAR(150);
-- SQL Server:
ALTER TABLE Employees
ALTER COLUMN email VARCHAR(150);
-- MySQL:
ALTER TABLE Employees
MODIFY COLUMN email VARCHAR(150);
- Shrinking a column's length (e.g.,
VARCHAR(100)toVARCHAR(50)) can truncate existing data if values exceed the new limit. - Changing a data type (e.g.,
VARCHARtoINT) will fail if existing data cannot be converted (e.g., trying to convert 'abc' to an integer). - Always back up your database before performing significant `ALTER` operations on production systems.
Destructive Commands (Edge Cases)
These commands permanently remove data or structures. Use with extreme caution!
DROP (Remove structure + data)
DROP removes an entire database object (table, database, index, view). When you drop a table, its structure, indexes, permissions, and all its data are permanently deleted.
DROP TABLE OldDataArchive;
DROP DATABASE TestDatabase;
TRUNCATE (Reset data, keep structure, fast log)
TRUNCATE TABLE removes all rows from a table, effectively emptying it. Unlike `DELETE`, it does so by deallocating the data pages, making it much faster and using fewer transaction log resources, especially for large tables. It maintains the table structure, indexes, and constraints.
TRUNCATE TABLE TempLogTable;
Comparison: DELETE vs. TRUNCATE vs. DROP
Understanding the nuances of these destructive commands is vital for database management and preventing accidental data loss.
| Feature | DELETE | TRUNCATE | DROP |
|---|---|---|---|
| What it removes | Rows (based on WHERE clause) |
All rows (data only) | Table definition + all data |
| Is it DML or DDL? | DML | DDL | DDL |
| Can it be rolled back? | Yes (if in a transaction) | No (implicitly commits in many RDBMS) | No (implicitly commits) |
| Speed for large tables | Slower (row-by-row logging) | Faster (deallocates data pages) | Fast |
| Uses WHERE clause? | Yes | No | No |
| Resets AUTO_INCREMENT? | No (unless all rows deleted in some systems) | Yes (in most RDBMS) | Yes (table recreated) |
| Fires Triggers? | Yes (row-level triggers) | No | No |
| Retains structure/indexes? | Yes | Yes | No |
Practice & Application
🎯 Practice: Creating Tables with Mixed Constraints
You need to create two tables for an e-commerce platform: Categories and Products.
The Categories table should have category_id (Primary Key, integer) and category_name (unique, not null, varchar(100)).
The Products table should have:
product_id(Primary Key, integer, auto-incrementing if possible)product_name(not null, unique, varchar(255))price(decimal, must be greater than 0)stock_quantity(integer, default 0)category_id(Foreign Key referencingCategories.category_id)
Click for Solution
This problem demonstrates how to use CREATE TABLE and define various types of constraints, both inline and externally, as discussed in the DDL section.
-- First, create the Categories table as Products will reference it
CREATE TABLE Categories (
category_id INT PRIMARY KEY,
category_name VARCHAR(100) NOT NULL UNIQUE
);
-- Now, create the Products table
CREATE TABLE Products (
product_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- PostgreSQL syntax for auto-increment PK
-- (For MySQL, use INT AUTO_INCREMENT PRIMARY KEY)
-- (For SQL Server, use INT IDENTITY(1,1) PRIMARY KEY)
product_name VARCHAR(255) NOT NULL UNIQUE,
price DECIMAL(10, 2) NOT NULL CHECK (price > 0), -- Inline CHECK constraint
stock_quantity INT DEFAULT 0, -- Inline DEFAULT constraint
category_id INT,
-- External Foreign Key constraint for clarity
CONSTRAINT fk_category
FOREIGN KEY (category_id) REFERENCES Categories(category_id)
);
Explanation:
PRIMARY KEYis defined inline forcategory_idandproduct_id.NOT NULLandUNIQUEare also inline forcategory_nameandproduct_name.CHECK (price > 0)ensures that no product can have a non-positive price.DEFAULT 0sets the initial stock quantity if not specified.- The
FOREIGN KEYconstraint forcategory_idin theProductstable is defined externally for better readability and to give it a specific name (`fk_category`), linking it to thecategory_idin theCategoriestable.
🎯 Practice: Modifying Table Structures with ALTER TABLE
You have the Products table created previously. Now, a new requirement comes in:
- Add a new column called
descriptionof typeTEXT(which can be NULL) to theProductstable. - It's realized that product names might be longer than 255 characters, so you need to increase the length of the
product_namecolumn toVARCHAR(500).
Click for Solution
This exercise uses ALTER TABLE to modify an existing table's structure by adding a column and changing an existing column's data type, highlighting potential risks.
-- 1. Add the 'description' column
ALTER TABLE Products
ADD COLUMN description TEXT;
-- 2. Increase the length of 'product_name'
-- Syntax varies slightly by RDBMS:
-- PostgreSQL:
ALTER TABLE Products
ALTER COLUMN product_name TYPE VARCHAR(500);
-- SQL Server:
ALTER TABLE Products
ALTER COLUMN product_name VARCHAR(500);
-- MySQL:
ALTER TABLE Products
MODIFY COLUMN product_name VARCHAR(500);
Explanation:
ALTER TABLE Products ADD COLUMN description TEXT;simply appends a new column. SinceNOT NULLisn't specified, it defaults to allowing NULLs.- Changing data types (like
VARCHAR(255)toVARCHAR(500)) is generally safe when increasing length. However, if you were to decrease the length, you would risk truncating existing data if any product names exceeded the new, smaller limit. Always consider existing data when modifying column definitions.
🎯 Practice: Removing Columns and Adding Composite Constraints
Continuing with the Products table:
- The business decides that the
stock_quantitycolumn is managed by an external inventory system and is no longer needed in the database. Remove this column. - To prevent duplicate products within the same category (e.g., two products named 'Milk' in the 'Dairy' category but not across categories), add a composite UNIQUE constraint on
product_nameandcategory_id.
Click for Solution
This problem uses ALTER TABLE to remove a column and add a composite constraint, reinforcing the concept of multi-column constraints.
-- 1. Drop the 'stock_quantity' column
ALTER TABLE Products
DROP COLUMN stock_quantity;
-- 2. Add a composite UNIQUE constraint
-- Note: If 'product_name' was already UNIQUE on its own, this would be redundant
-- but demonstrates composite constraint syntax.
ALTER TABLE Products
ADD CONSTRAINT UQ_ProductCategoryName UNIQUE (product_name, category_id);
Explanation:
DROP COLUMN stock_quantitypermanently removes the column and all its data. This is irreversible unless you have a backup.ADD CONSTRAINT UQ_ProductCategoryName UNIQUE (product_name, category_id)creates a unique constraint across *both* columns. This means the combination of `product_name` and `category_id` must be unique. You could have "Milk" in Category A and "Milk" in Category B, but not two "Milk" products in Category A.
🎯 Practice: Understanding DELETE, TRUNCATE, and DROP
Imagine you have a table called AuditLogs with millions of records. Describe the SQL commands you would use for the following scenarios, and explain why each is appropriate:
- Remove all audit logs specifically older than 6 months, while allowing the operation to be easily undone if a mistake is made within a transaction.
- You need to completely clear all data from the
AuditLogstable to start fresh, but you want to keep the table structure, indexes, and constraints. This operation should be as fast as possible for a very large table. - The
AuditLogsfeature is being completely decommissioned, and the table, along with all its data, is no longer needed in the database.
Click for Solution
This problem directly tests your understanding of the critical differences between DELETE, TRUNCATE, and DROP, especially in terms of data removal, rollback capability, and performance.
-- Scenario 1: Remove specific old records, with rollback capability.
-- Assuming 'log_date' is the timestamp column.
BEGIN TRANSACTION; -- Start a transaction (syntax might vary: START TRANSACTION;)
DELETE FROM AuditLogs
WHERE log_date < CURRENT_DATE - INTERVAL '6 MONTH'; -- PostgreSQL example (adjust for your RDBMS)
-- For SQL Server: WHERE log_date < DATEADD(month, -6, GETDATE());
-- For MySQL: WHERE log_date < DATE_SUB(CURDATE(), INTERVAL 6 MONTH);
-- If you realize you made a mistake:
-- ROLLBACK;
-- If satisfied:
-- COMMIT;
Explanation for Scenario 1 (DELETE):
DELETEis a DML command, meaning it operates row by row and is fully logged.- It allows a
WHEREclause to target specific records. - Crucially, it respects transactions, so if wrapped in a
BEGIN TRANSACTIONandCOMMIT/ROLLBACK, the changes can be undone if an error occurs or the decision needs to be reversed. This makes it ideal for targeted, reversible deletions. However, for millions of rows, it can be slower and consume more transaction log space thanTRUNCATE.
-- Scenario 2: Clear all data, keep structure, fastest way.
TRUNCATE TABLE AuditLogs;
Explanation for Scenario 2 (TRUNCATE):
TRUNCATE TABLEis a DDL command. It quickly removes all rows from a table by deallocating the data pages, making it extremely fast for large tables.- It keeps the table's structure, indexes, and constraints intact.
- In most RDBMS,
TRUNCATEcauses an implicit commit, meaning it cannot be rolled back. It also typically resets any auto-incrementing identity columns. This is suitable when you need to completely empty a table for a fresh start and don't need to undo the operation.
-- Scenario 3: Decommission the entire table.
DROP TABLE AuditLogs;
Explanation for Scenario 3 (DROP):
DROP TABLEis a DDL command. It removes the entire table definition from the database, including all its data, indexes, constraints, and triggers.- Like
TRUNCATE, it implicitly commits and cannot be rolled back. - Use this when the table and its data are permanently no longer needed.
5. Data Manipulation Language (DML) & Querying
Data Manipulation Language (DML) commands are the core of interacting with your data. These commands allow you to add, modify, delete, and retrieve data stored within your database tables.
Modifying Data
INSERT (Adding new rows)
The INSERT statement adds one or more new rows of data into a table. You can insert a single row or multiple rows in a single command.
-- Single Row Insert (specifying columns)
INSERT INTO Employees (employee_id, first_name, last_name, email, hire_date, salary, department_id)
VALUES (101, 'Alice', 'Smith', 'alice@example.com', '2023-01-15', 75000.00, 1);
-- Single Row Insert (all columns, order must match table definition)
INSERT INTO Employees
VALUES (102, 'Bob', 'Johnson', 'bob@example.com', '2023-02-20', 82000.00, 2, NULL); -- Assuming a 'phone_number' column, which is NULL
-- Bulk Insert (multiple rows)
INSERT INTO Employees (employee_id, first_name, last_name, email, hire_date, salary, department_id)
VALUES
(103, 'Charlie', 'Brown', 'charlie@example.com', '2023-03-01', 68000.00, 1),
(104, 'Diana', 'Prince', 'diana@example.com', '2023-03-10', 95000.00, 3);
-- Inserting from another SELECT query (e.g., from a staging table)
INSERT INTO ProductionTable (col1, col2)
SELECT staging_col1, staging_col2
FROM StagingTable
WHERE status = 'processed';
UPDATE (Modifying existing rows)
The UPDATE statement modifies existing data in a table. It is absolutely critical to use a WHERE clause with UPDATE to specify which rows should be affected.
WHERE clause:
If you omit the WHERE clause in an UPDATE statement, all rows in the table will be updated. This is a common and dangerous mistake that can lead to catastrophic data loss or corruption.
-- Update a single employee's salary
UPDATE Employees
SET salary = 80000.00
WHERE employee_id = 101;
-- Update multiple employees in a specific department
UPDATE Employees
SET salary = salary * 1.05 -- Give a 5% raise
WHERE department_id = 1 AND hire_date < '2023-01-01';
-- Update a column to NULL
UPDATE Employees
SET email = NULL
WHERE employee_id = 103;
DELETE (Removing existing rows)
The DELETE statement removes one or more rows from a table. Like UPDATE, the WHERE clause is paramount.
WHERE clause:
If you omit the WHERE clause in a DELETE statement, all rows in the table will be deleted. This is also a common mistake with severe consequences.
-- Delete a single employee record
DELETE FROM Employees
WHERE employee_id = 102;
-- Delete all employees from a specific department
DELETE FROM Employees
WHERE department_id = 3;
-- Delete all employees hired before a certain date
DELETE FROM Employees
WHERE hire_date < '2022-01-01';
The SELECT Statement Deep Dive
The SELECT statement is the most frequently used DML command, used for retrieving data from a database. Mastering its various clauses is essential.
Selection Logic (CASE Statements)
CASE statements allow you to apply conditional logic directly within your query, creating new derived columns based on conditions. This is incredibly powerful for data transformation and reporting.
SELECT
employee_id,
first_name,
salary,
CASE
WHEN salary < 60000 THEN 'Junior'
WHEN salary BETWEEN 60000 AND 90000 THEN 'Mid-Level'
WHEN salary > 90000 THEN 'Senior'
ELSE 'Unknown' -- Optional, but good practice for completeness
END AS career_level,
CASE department_id
WHEN 1 THEN 'Marketing'
WHEN 2 THEN 'Sales'
WHEN 3 THEN 'Engineering'
ELSE 'Other'
END AS department_name
FROM
Employees;
Key points:
CASEstatements evaluate conditions sequentially. The firstWHENcondition that evaluates toTRUEwill have its correspondingTHENvalue returned.- If no
WHENconditions are met, the value in theELSEclause is returned. IfELSEis omitted and no conditions are met,NULLis returned. - There are two main forms:
- Searched
CASE:CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ... END(More flexible, conditions can be anything). - Simple
CASE:CASE column_name WHEN value1 THEN result1 WHEN value2 THEN result2 ... END(Compares a single expression to various values).
- Searched
Filtering (WHERE)
The WHERE clause filters rows based on specified conditions before any grouping or aggregation (remember the order of execution!).
- Operator Precedence (`AND` vs. `OR`):
ANDhas higher precedence thanOR. This means conditions joined byANDare evaluated first.- Use parentheses to explicitly control the order of evaluation and avoid unexpected results.
-- Example: (salary > 70000 AND department_id = 1) OR hire_date > '2023-06-01'
-- Employees earning >70k in Dept 1, OR anyone hired after June 1, 2023.
SELECT *
FROM Employees
WHERE (salary > 70000 AND department_id = 1) OR hire_date > '2023-06-01';
- Pattern Matching (`LIKE` with `%` and `_`):
%: Matches any sequence of zero or more characters._: Matches any single character.
-- Find employees whose first name starts with 'A'
SELECT * FROM Employees WHERE first_name LIKE 'A%';
-- Find employees whose last name has 'son' anywhere in it
SELECT * FROM Employees WHERE last_name LIKE '%son%';
-- Find employees whose email is exactly 5 characters long before '@example.com'
SELECT * FROM Employees WHERE email LIKE '_____@example.com';
- List Filtering (`IN` vs. `NOT IN`):
IN: Matches any value in a specified list.NOT IN: Matches any value *not* in a specified list.
-- Find employees in departments 1 or 3
SELECT * FROM Employees WHERE department_id IN (1, 3);
-- Find employees NOT in departments 1, 2, or 3
SELECT * FROM Employees WHERE department_id NOT IN (1, 2, 3);
NOT IN contains a NULL value (e.g., NOT IN (1, 2, NULL)), the entire NOT IN condition will always evaluate to UNKNOWN for all rows, effectively returning zero rows. Be very careful with NULLs in NOT IN lists.
- Range Filtering (`BETWEEN` inclusive boundaries):
BETWEEN value1 AND value2: Matches values within the specified range, *inclusive* of bothvalue1andvalue2.
-- Find employees with salary between 60,000 and 80,000 (inclusive)
SELECT * FROM Employees WHERE salary BETWEEN 60000.00 AND 80000.00;
-- Find employees hired in Q1 2023
SELECT * FROM Employees WHERE hire_date BETWEEN '2023-01-01' AND '2023-03-31';
- Edge Case: Filtering NULLs (`IS NULL` vs. `= NULL`):
- As discussed in Section 2, direct comparison operators like
=,!=,<,>withNULLalways result inUNKNOWN. - To find rows where a column is
NULL, useIS NULL. - To find rows where a column is not
NULL, useIS NOT NULL.
- As discussed in Section 2, direct comparison operators like
-- Correctly find employees with no assigned phone number
SELECT * FROM Employees WHERE phone_number IS NULL;
-- Correctly find employees who DO have an assigned phone number
SELECT * FROM Employees WHERE phone_number IS NOT NULL;
-- This will NOT work as expected (returns no rows, because NULL = NULL is UNKNOWN)
-- SELECT * FROM Employees WHERE phone_number = NULL;
Sorting & Limits
Once rows are filtered, you often want to present them in a specific order and possibly retrieve only a subset.
- `ORDER BY` (Multi-level sorting):
- Sorts the result set by one or more columns in ascending (`ASC`, default) or descending (`DESC`) order.
- You can sort by multiple columns, with later columns acting as tie-breakers.
-- Order by last name ascending, then first name ascending
SELECT *
FROM Employees
ORDER BY last_name ASC, first_name ASC;
-- Order by salary descending, then hire date descending (most recent first)
SELECT *
FROM Employees
ORDER BY salary DESC, hire_date DESC;
- Handling NULLs in sort order (`NULLS FIRST`/`LAST`):
- Many RDBMS (like PostgreSQL, Oracle) allow you to specify how
NULLvalues should be treated in sorting (e.g., appear first or last). SQL Server and MySQL handle NULLs differently by default (often treating them as the lowest values).
- Many RDBMS (like PostgreSQL, Oracle) allow you to specify how
-- PostgreSQL example: Employees with NULL phone numbers appear first when sorting by phone number
SELECT employee_id, first_name, phone_number
FROM Employees
ORDER BY phone_number ASC NULLS FIRST;
-- PostgreSQL example: Employees with NULL phone numbers appear last
SELECT employee_id, first_name, phone_number
FROM Employees
ORDER BY phone_number ASC NULLS LAST;
- `LIMIT`/`OFFSET` (Pagination basics):
LIMIT n: Restricts the number of rows returned by the query to `n`.OFFSET m: Skips `m` rows before beginning to return the rows. Used for pagination.- Syntax varies:
LIMIT x OFFSET y(PostgreSQL, MySQL, SQLite)TOP x(SQL Server)FETCH FIRST x ROWS ONLY OFFSET y ROWS(SQL Standard, Oracle 12c+)
-- Get the 5 highest-paid employees
SELECT *
FROM Employees
ORDER BY salary DESC
LIMIT 5; -- MySQL/PostgreSQL
-- For SQL Server:
-- SELECT TOP 5 * FROM Employees ORDER BY salary DESC;
-- Get the next 5 highest-paid employees (e.g., for page 2 of results)
SELECT *
FROM Employees
ORDER BY salary DESC
LIMIT 5 OFFSET 5; -- MySQL/PostgreSQL (skips the first 5, gets the next 5)
-- For SQL Server (using OFFSET FETCH):
-- SELECT * FROM Employees ORDER BY salary DESC OFFSET 5 ROWS FETCH NEXT 5 ROWS ONLY;
Practice & Application
🎯 Practice: Managing Employee Data with DML
Perform the following operations on the Employees table:
- Insert a new employee: Employee ID 105, John Doe, john.doe@example.com, hired today, $70,000 salary, Department ID 1, no phone number.
- Update employee 101 (Alice Smith) to give her a 10% raise.
- Delete any employee whose
emailis NULL (assuming such records exist for some reason).
Click for Solution
This problem demonstrates the core DML commands: INSERT, UPDATE, and DELETE, emphasizing the importance of the WHERE clause.
-- Assume Employees table structure:
-- (employee_id INT, first_name VARCHAR, last_name VARCHAR, email VARCHAR,
-- hire_date DATE, salary DECIMAL, department_id INT, phone_number VARCHAR)
-- 1. Insert a new employee
INSERT INTO Employees (employee_id, first_name, last_name, email, hire_date, salary, department_id, phone_number)
VALUES (105, 'John', 'Doe', 'john.doe@example.com', CURRENT_DATE, 70000.00, 1, NULL);
-- 2. Update employee 101 (Alice Smith) with a 10% raise
UPDATE Employees
SET salary = salary * 1.10
WHERE employee_id = 101;
-- 3. Delete employees with a NULL email
DELETE FROM Employees
WHERE email IS NULL; -- Correctly using IS NULL for filtering NULLs
Explanation:
INSERTusesCURRENT_DATE(or equivalent likeGETDATE()/CURDATE()) to record today's date. SpecifyingNULLfor `phone_number` explicitly sets it to missing.- The
UPDATEstatement correctly uses `employee_id = 101` in itsWHEREclause to target only Alice Smith. Omitting this would have given *everyone* a 10% raise! - The
DELETEstatement correctly usesemail IS NULL. Remember,email = NULLwould never return TRUE.
🎯 Practice: Categorizing Employees with CASE Statements
Retrieve a list of all employees, their current salary, and a new column named SalaryCategory.
The SalaryCategory should be defined as follows:
- 'Low' if salary is less than $60,000
- 'Medium' if salary is between $60,000 and $90,000 (inclusive)
- 'High' if salary is greater than $90,000
- 'Undetermined' if salary is NULL or somehow doesn't fit (though not expected with current logic)
Click for Solution
This problem utilizes the CASE statement to introduce conditional logic into the SELECT clause, creating a derived attribute based on existing data.
SELECT
employee_id,
first_name,
last_name,
salary,
CASE
WHEN salary IS NULL THEN 'Undetermined' -- Handle NULL first if desired
WHEN salary < 60000 THEN 'Low'
WHEN salary BETWEEN 60000 AND 90000 THEN 'Medium'
WHEN salary > 90000 THEN 'High'
ELSE 'Undetermined' -- Catch-all, though previous WHENs should cover most
END AS SalaryCategory
FROM
Employees
ORDER BY
salary DESC;
Explanation:
- The
CASEstatement evaluates conditions in order. It's often good practice to handleNULLcases first if they should take precedence. BETWEEN 60000 AND 90000is inclusive, covering salaries at both ends of the range.- The
ELSEclause ensures that every row will have aSalaryCategory, preventingNULLs in the new column unless specifically desired.
🎯 Practice: Complex Filtering with WHERE and Pattern Matching
Find all employees who meet the following criteria:
- Their first name starts with 'A' OR their last name contains 'o'.
- AND they were hired in 2023.
- AND their department ID is either 1 or 3.
Click for Solution
This exercise combines multiple filtering conditions using AND, OR, LIKE, and IN, requiring careful consideration of operator precedence.
SELECT
employee_id,
first_name,
last_name,
hire_date,
department_id
FROM
Employees
WHERE
(first_name LIKE 'A%' OR last_name LIKE '%o%') -- Parentheses are CRUCIAL here due to AND/OR precedence
AND hire_date BETWEEN '2023-01-01' AND '2023-12-31'
AND department_id IN (1, 3)
ORDER BY
hire_date ASC;
Explanation:
(first_name LIKE 'A%' OR last_name LIKE '%o%'): Parentheses are essential. Without them,AND hire_date...would be evaluated with only `last_name LIKE '%o%'` first, potentially leading to incorrect results. This ensures the name condition is treated as a single unit.hire_date BETWEEN '2023-01-01' AND '2023-12-31': Efficiently filters for the entire year 2023.department_id IN (1, 3): Filters for specific department IDs.- All three main conditions are linked by
AND, meaning all must be true for a row to be returned.
🎯 Practice: Paginated Employee List with NULLs Ordering
Retrieve a paginated list of employees, showing the second page of results, with 5 employees per page. Sort the employees primarily by department_id in ascending order. If two employees are in the same department, sort them by phone_number in descending order, ensuring that employees with no phone number (NULL) appear at the very end of their department group.
Click for Solution
This problem combines multi-level sorting, explicit NULL handling in sorting, and pagination using LIMIT and OFFSET.
SELECT
employee_id,
first_name,
department_id,
phone_number
FROM
Employees
ORDER BY
department_id ASC,
phone_number DESC NULLS LAST -- Sort phone numbers DESC, with NULLs at the end
LIMIT 5 OFFSET 5; -- For Page 2, LIMIT 5 (rows per page), OFFSET 5 (skip first page)
Explanation:
ORDER BY department_id ASC: The primary sort key, arranging employees by department.phone_number DESC NULLS LAST: The secondary sort key. It sorts phone numbers in descending order (e.g., '999-...' before '111-...'), but specifically places anyNULLphone numbers at the very end within each department group. This relies on RDBMS support forNULLS LAST(like PostgreSQL/Oracle). For MySQL/SQL Server, you might need aCASEstatement inORDER BYfor explicit NULL ordering.LIMIT 5 OFFSET 5: This is for pagination.LIMIT 5requests 5 rows, andOFFSET 5skips the first 5 rows (effectively giving you rows 6-10, which is the second page).
6. Aggregation, Grouping & Granularity
While SELECT retrieves individual rows, often you need to summarize data. This is where aggregation and grouping come into play, allowing you to derive insights from collections of rows rather than just individual ones.
Aggregate Functions
Aggregate functions (also known as "group functions") perform a calculation on a set of rows and return a single value.
COUNT(*)vs.COUNT(column_name)(Handling NULLs):COUNT(*): Counts all rows in the result set, including those withNULLvalues in any column. It literally counts "how many rows are there?".COUNT(column_name): Counts only the non-NULLvalues in the specifiedcolumn_name. It counts "how many values are present in this column?".COUNT(DISTINCT column_name): Counts the number of unique non-NULLvalues in the specified column.
-- Assume Employees table has some NULL emails and phone numbers.
-- Total number of employees (rows)
SELECT COUNT(*) AS total_employees
FROM Employees; -- e.g., 100
-- Number of employees with a recorded email address
SELECT COUNT(email) AS employees_with_email
FROM Employees; -- e.g., 95 (if 5 emails are NULL)
-- Number of unique department IDs
SELECT COUNT(DISTINCT department_id) AS unique_departments
FROM Employees;
SUM(),AVG():SUM(column_name): Calculates the total sum of non-NULLvalues in a numeric column.AVG(column_name): Calculates the average of non-NULLvalues in a numeric column.
-- Total salary paid across all employees
SELECT SUM(salary) AS total_payroll
FROM Employees;
-- Average salary of all employees
SELECT AVG(salary) AS average_salary
FROM Employees;
MIN(),MAX()(Works on Strings/Dates too):MIN(column_name): Returns the smallest value in the column (numerically, alphabetically, or chronologically).MAX(column_name): Returns the largest value in the column (numerically, alphabetically, or chronologically).
-- Highest and lowest salaries
SELECT MAX(salary) AS highest_salary, MIN(salary) AS lowest_salary
FROM Employees;
-- Earliest and latest hire dates
SELECT MIN(hire_date) AS earliest_hire, MAX(hire_date) AS latest_hire
FROM Employees;
-- Alphabetically first and last last names
SELECT MIN(last_name) AS first_last_name, MAX(last_name) AS last_last_name
FROM Employees;
Grouping Mechanics (`GROUP BY`)
The GROUP BY clause divides the rows returned by the FROM and WHERE clauses into groups, and then an aggregate function is applied to each group. This allows you to get summaries *per category*.
- The rule of "Non-aggregated columns must be grouped":
Any column that appears in the
SELECTlist that is *not* an aggregate function (likeCOUNT(),SUM(), etc.) must also appear in theGROUP BYclause. This ensures that the database knows how to categorize and summarize the data.-- Valid: Calculate average salary PER department SELECT department_id, -- This is not aggregated AVG(salary) AS avg_salary_by_dept FROM Employees GROUP BY department_id; -- So it MUST be in GROUP BY -- Invalid (will cause an error): Trying to select first_name without grouping it -- SELECT department_id, first_name, AVG(salary) FROM Employees GROUP BY department_id; -- Error: column "Employees.first_name" must appear in the GROUP BY clause or be used in an aggregate function - Grouping by expressions or derived values:
You can group by any column, or even an expression or derived value (like a
CASEstatement or a part of a date).-- Group by hire year (PostgreSQL/MySQL example) SELECT EXTRACT(YEAR FROM hire_date) AS hire_year, -- Extract year COUNT(*) AS total_hires FROM Employees GROUP BY EXTRACT(YEAR FROM hire_date); -- Group by the extracted year -- Group by a CASE statement's output (e.g., career_level from previous section) SELECT CASE WHEN salary < 60000 THEN 'Low' WHEN salary BETWEEN 60000 AND 90000 THEN 'Medium' WHEN salary > 90000 THEN 'High' ELSE 'Undetermined' END AS SalaryCategory, COUNT(*) AS num_employees FROM Employees GROUP BY SalaryCategory; -- Grouping by the alias is often supported as well, or the full CASE expression
Filtering Groups (`HAVING`)
The HAVING clause is used to filter groups based on conditions applied to aggregate functions. It operates after GROUP BY.
WHEREvs.HAVING(Pre-aggregation vs. Post-aggregation):WHERE: Filters individual rows beforeGROUP BYis applied. It cannot use aggregate functions.HAVING: Filters entire groups afterGROUP BYis applied and aggregate functions have been calculated. It *must* use aggregate functions (or columns from theGROUP BYclause).
-- Find departments where the average salary is above 70,000,
-- considering only employees hired after 2022.
SELECT
department_id,
AVG(salary) AS avg_dept_salary,
COUNT(*) AS num_employees_in_group
FROM
Employees
WHERE
hire_date > '2022-12-31' -- Filters individual employees BEFORE grouping
GROUP BY
department_id
HAVING
AVG(salary) > 70000; -- Filters groups AFTER aggregation
Funnel Chart: Data Reduction from FROM → WHERE → GROUP BY → HAVING
This visual demonstrates how each clause reduces the data set, narrowing down the focus from all raw data to the specific insights you need.
FROM), then discards individual rows not meeting criteria (WHERE), then groups the remaining rows (GROUP BY), and finally discards entire groups not meeting aggregate criteria (HAVING).
7. Joins & Relational Mechanics
Databases are powerful because they allow you to store related pieces of information in separate, organized tables. Joins are the cornerstone of the relational model, enabling you to combine data from two or more tables based on related columns.
Join Theory
A great way to conceptualize different join types is through Venn Diagrams, which illustrate how rows from two sets (tables) are combined.
Venn Diagrams: Interpreting Join Types
INNER JOIN (Intersection)
Returns only the rows that have matching values in both tables.
A ∩ B
LEFT JOIN (All Left + Matching Right)
Returns all rows from the left table, and the matching rows from the right table. If there's no match on the right, NULLs appear for the right table's columns.
A + (A ∩ B)
RIGHT JOIN (All Right + Matching Left)
Returns all rows from the right table, and the matching rows from the left table. If there's no match on the left, NULLs appear for the left table's columns.
B + (A ∩ B)
FULL OUTER JOIN (Union of Both)
Returns all rows when there is a match in one of the tables. If no match, NULLs appear for the non-matching side. It's the union of LEFT JOIN and RIGHT JOIN.
A ∪ B
Join Types & Behaviors
Let's look at the practical syntax and typical use cases.
-- Example Tables:
-- Customers (customer_id, customer_name)
-- Orders (order_id, customer_id, order_date, total_amount)
INNER JOIN(Intersection):Combines rows from two tables where the join condition is true. Rows that do not have a match in both tables are excluded.
SELECT C.customer_name, O.order_id, O.total_amount FROM Customers C INNER JOIN Orders O ON C.customer_id = O.customer_id; -- Shows only customers who have placed orders, and their orders.LEFT JOIN(Preserving left table, NULLs on right):Returns all rows from the left table (
Customers), and the matching rows from the right table (Orders). If there is no matching order for a customer, the order-related columns will beNULL.SELECT C.customer_name, O.order_id, O.total_amount FROM Customers C LEFT JOIN Orders O ON C.customer_id = O.customer_id; -- Shows ALL customers, including those with no orders (order_id, total_amount will be NULL for them).RIGHT JOIN(Preserving right table):Returns all rows from the right table (
Orders), and the matching rows from the left table (Customers). If an order has no matching customer (e.g., bad data), the customer-related columns will beNULL.SELECT C.customer_name, O.order_id, O.total_amount FROM Customers C RIGHT JOIN Orders O ON C.customer_id = O.customer_id; -- Shows ALL orders, even if they don't have a matching customer (customer_name will be NULL). -- Less common as LEFT JOIN can often achieve the same by swapping tables.FULL OUTER JOIN(Union of both):Returns all rows when there is a match in either the left or right table. If a row in one table doesn't have a match in the other, the non-matching side will have
NULLs.SELECT C.customer_name, O.order_id, O.total_amount FROM Customers C FULL OUTER JOIN Orders O ON C.customer_id = O.customer_id; -- Shows all customers (with or without orders) AND all orders (with or without customers). -- Where there's no match, the respective columns will be NULL.CROSS JOIN(Cartesian Product - Warning signs):Combines every row from the first table with every row from the second table. If table A has $N$ rows and table B has $M$ rows, the result will have $N \times M$ rows. It does not require an `ON` clause.
SELECT C.customer_name, O.order_id FROM Customers C CROSS JOIN Orders O; -- Produces every possible combination of a customer and an order. -- Use with extreme caution, often signals an error if not intended (e.g., generating all possible pairings).SELF JOIN(Hierarchical data):Joining a table to itself. This is useful for comparing rows within the same table, often for hierarchical data (e.g., finding employees and their managers in an
Employeestable).-- Assume Employees table has: employee_id, employee_name, manager_id SELECT E.employee_name AS Employee, M.employee_name AS Manager FROM Employees E LEFT JOIN Employees M ON E.manager_id = M.employee_id; -- Links each employee to their manager (or NULL if no manager).
Join Syntax & Best Practices
ON(General conditions) vs.USING(Named columns):ON: The most flexible way to specify join conditions. It allows any valid condition (equality, inequality, comparison operators, functions, etc.).FROM Customers C JOIN Orders O ON C.customer_id = O.customer_id AND O.order_date > '2023-01-01'USING: A shorthand forINNER JOINorLEFT JOINwhen you are joining on columns that have the same name in both tables.FROM Customers JOIN Orders USING (customer_id)USINGis syntactically cleaner when applicable, but less flexible thanON. The column specified inUSINGappears only once in the result set, unlikeONwhere both join columns from each table are retained.
- Aliasing tables for readability:
Assigning short aliases (e.g.,
CforCustomers,OforOrders) to tables makes queries shorter and much more readable, especially with multiple joins or self-joins.SELECT C.customer_name, O.order_date -- Use aliases to specify which table a column comes from FROM Customers AS C -- 'AS' is optional but good practice INNER JOIN Orders AS O ON C.customer_id = O.customer_id;
Common Pitfalls
- The "Fan-out" problem (Duplicate rows after join):
This occurs when a row in one table matches multiple rows in another table, leading to an explosion of duplicate (or near-duplicate) rows. This is especially problematic before aggregation, as it can inflate counts or sums.
Example: Customer with Multiple Phone NumbersImagine a
Customerstable and aCustomerPhonestable (one-to-many relationship). If a customer has two phone numbers and you join these tables, that customer's details will appear twice for each phone number. If you then try toCOUNTcustomers, you'll get an inflated number.Customers (C) CustomerPhones (CP) Result of SELECT * FROM C JOIN CP ON C.cust_id = CP.cust_idcust_id: 1, name: Alicephone_id: 101, cust_id: 1, number: 555-1234cust_id: 1, name: Alice, phone_id: 101, number: 555-1234phone_id: 102, cust_id: 1, number: 555-5678cust_id: 1, name: Alice, phone_id: 102, number: 555-5678cust_id: 2, name: Bobphone_id: 201, cust_id: 2, number: 555-9876cust_id: 2, name: Bob, phone_id: 201, number: 555-9876If you then do
COUNT(DISTINCT C.cust_id), it will still be correct (2 customers). But if you didCOUNT(*)on the joined result, you'd get 3, which is wrong if you're trying to count *customers*.Solution: Be mindful of relationship cardinality. Use
DISTINCTwith aggregate functions where appropriate, or aggregate in subqueries/CTEs *before* joining. - Joining on
NULLvalues:Remember from Section 2 that
NULL = NULLevaluates toUNKNOWN. This means that rows where the join column isNULLin both tables will not match using a standard equality join condition (ON T1.col = T2.col).-- If a customer's customer_id is NULL in Customers, and an order's customer_id is NULL in Orders, -- they will NOT join using: ON C.customer_id = O.customer_id -- To explicitly join on NULLs (rare but possible), you'd need: -- ON C.customer_id = O.customer_id OR (C.customer_id IS NULL AND O.customer_id IS NULL)
Practice & Application
🎯 Practice: Customers and Their Orders (INNER JOIN)
You have two tables: Customers (customer_id, customer_name) and Orders (order_id, customer_id, total_amount). Retrieve a list of all customers who have placed at least one order, along with the details of those orders (order ID and total amount). Only include customers who actually have orders.
Click for Solution
This is a classic use case for an INNER JOIN. It returns only the rows where there's a match in both the Customers and Orders tables based on their common customer_id.
-- Assume Customer and Orders tables exist with sample data:
-- Customers: { (1, 'Alice'), (2, 'Bob'), (3, 'Charlie') }
-- Orders: { (101, 1, 50.00), (102, 2, 75.00), (103, 1, 120.00) }
SELECT
C.customer_name,
O.order_id,
O.total_amount
FROM
Customers C
INNER JOIN
Orders O ON C.customer_id = O.customer_id;
Expected Output Logic:
| customer_name | order_id | total_amount |
|---|---|---|
| Alice | 101 | 50.00 |
| Alice | 103 | 120.00 |
| Bob | 102 | 75.00 |
Explanation:
INNER JOINensures that only customers with corresponding entries in theOrderstable (and vice-versa) are included in the result.- Customer 'Charlie' (ID 3) does not appear because there are no orders with
customer_id = 3.
🎯 Practice: All Customers and Their Orders (LEFT JOIN)
Using the same Customers and Orders tables, retrieve a list of all customers. For customers who have placed orders, include their order details. For customers who haven't placed any orders, still list them, but show NULLs for order-related fields.
Click for Solution
This scenario requires a LEFT JOIN. It prioritizes the left table (Customers) by returning all its rows, and only matching rows from the right table (Orders). Non-matching right-side columns will show NULL.
-- Assume Customers: { (1, 'Alice'), (2, 'Bob'), (3, 'Charlie') }
-- Orders: { (101, 1, 50.00), (102, 2, 75.00), (103, 1, 120.00) }
SELECT
C.customer_name,
O.order_id,
O.total_amount
FROM
Customers C
LEFT JOIN
Orders O ON C.customer_id = O.customer_id;
Expected Output Logic:
| customer_name | order_id | total_amount |
|---|---|---|
| Alice | 101 | 50.00 |
| Alice | 103 | 120.00 |
| Bob | 102 | 75.00 |
| Charlie | NULL | NULL |
Explanation:
LEFT JOINensures that all rows from theCustomerstable are present in the result.- Customer 'Charlie' appears, even without orders, because it exists in the left table.
order_idandtotal_amountareNULLfor Charlie.
🎯 Practice: All Orders and Their Customers (RIGHT JOIN)
Retrieve a list of all orders. For orders that have a matching customer, display the customer's name. If an order's customer_id does not exist in the Customers table (e.g., due to bad data or a deleted customer), still list the order, but show NULL for the customer's name.
Click for Solution
This is a perfect scenario for a RIGHT JOIN. It preserves all rows from the right table (Orders) and matches them with the left table (Customers) where possible.
-- Assume Customers: { (1, 'Alice'), (2, 'Bob'), (3, 'Charlie') }
-- Orders: { (101, 1, 50.00), (102, 2, 75.00), (103, 1, 120.00), (104, 999, 200.00) } -- Order 104 has no matching customer
SELECT
C.customer_name,
O.order_id,
O.total_amount
FROM
Customers C
RIGHT JOIN
Orders O ON C.customer_id = O.customer_id;
Expected Output Logic:
| customer_name | order_id | total_amount |
|---|---|---|
| Alice | 101 | 50.00 |
| Bob | 102 | 75.00 |
| Alice | 103 | 120.00 |
| NULL | 104 | 200.00 |
Explanation:
RIGHT JOINensures that all rows from theOrderstable are present.- Order '104' appears, even though
customer_id=999doesn't exist inCustomers. Thecustomer_nameisNULLfor this row. - Note that you could achieve the same result with a
LEFT JOINby simply swapping the tables:FROM Orders O LEFT JOIN Customers C ON O.customer_id = C.customer_id;
🎯 Practice: Complete Customer-Order Relationship (FULL OUTER JOIN)
Retrieve a comprehensive list that includes all customers AND all orders. Customers without orders should appear with NULL order details. Orders without matching customers should appear with NULL customer details. Customers and orders with matches should show all their respective details.
Click for Solution
A FULL OUTER JOIN (often just FULL JOIN) achieves this by returning the union of both tables. It includes all rows from the left table and all rows from the right table, filling in NULLs where no match exists on either side.
-- Assume Customers: { (1, 'Alice'), (2, 'Bob'), (3, 'Charlie') }
-- Orders: { (101, 1, 50.00), (102, 2, 75.00), (103, 1, 120.00), (104, 999, 200.00) }
SELECT
C.customer_name,
O.order_id,
O.total_amount
FROM
Customers C
FULL OUTER JOIN -- Or FULL JOIN in many systems
Orders O ON C.customer_id = O.customer_id;
Expected Output Logic:
| customer_name | order_id | total_amount |
|---|---|---|
| Alice | 101 | 50.00 |
| Alice | 103 | 120.00 |
| Bob | 102 | 75.00 |
| Charlie | NULL | NULL |
| NULL | 104 | 200.00 |
Explanation:
- This result set contains all matches, plus the non-matching rows from the
Customerstable (Charlie) and the non-matching rows from theOrderstable (Order 104).
🎯 Practice: Finding Employees and Their Managers (SELF JOIN)
You have an Employees table with employee_id, employee_name, and manager_id (which refers to another employee_id in the same table). Retrieve a list showing each employee and the name of their manager. Employees without a manager should still be listed, showing NULL for the manager's name.
Click for Solution
A SELF JOIN is used here to join a table to itself. We treat the table as two logical instances: one for employees and one for managers. A LEFT JOIN ensures all employees are listed, even if they don't have a manager.
-- Assume Employees table:
-- { (1, 'Alice', NULL), (2, 'Bob', 1), (3, 'Charlie', 1), (4, 'David', 2) }
-- Alice (ID 1) has no manager. Bob and Charlie report to Alice. David reports to Bob.
SELECT
E.employee_name AS EmployeeName,
M.employee_name AS ManagerName
FROM
Employees E -- Alias the table as 'E' for Employee
LEFT JOIN
Employees M ON E.manager_id = M.employee_id; -- Alias as 'M' for Manager
Expected Output Logic:
| EmployeeName | ManagerName |
|---|---|
| Alice | NULL |
| Bob | Alice |
| Charlie | Alice |
| David | Bob |
Explanation:
- By aliasing the
Employeestable asEandM, we can refer to it as if it were two separate tables. - The join condition
E.manager_id = M.employee_idlinks an employee'smanager_idto their manager'semployee_id. - The
LEFT JOINensures that 'Alice', who has no manager (manager_idisNULL), is still included in the result withManagerNameasNULL.
🎯 Practice: Mitigating the Fan-Out Problem with Aggregation
You have Customers (customer_id, customer_name), Orders (order_id, customer_id), and a new table Addresses (address_id, customer_id, street). A customer can have multiple addresses.
Your goal is to count the total number of orders and the total number of distinct customers, then join this aggregated data with each customer's address information.
Click for Solution
This problem demonstrates the "fan-out" issue and how to resolve it by aggregating data *before* joining to tables that might cause row duplication. If we joined Customers to Addresses first, a customer with 3 addresses would appear 3 times, incorrectly inflating order counts if we then joined to Orders.
-- Assume tables:
-- Customers: { (1, 'Alice'), (2, 'Bob') }
-- Orders: { (101, 1), (102, 1), (103, 2) }
-- Addresses: { (A1, 1, '123 Main St'), (A2, 1, '456 Oak Ave'), (A3, 2, '789 Pine Rd') }
-- Incorrect approach (Fan-out problem leading to inflated order counts):
-- If you first join Customers and Addresses, Alice would appear twice.
-- Then joining Orders would make Alice's orders count twice for each address.
/*
SELECT
C.customer_name,
A.street,
COUNT(O.order_id) AS total_orders -- This would be inflated for Alice
FROM
Customers C
LEFT JOIN Addresses A ON C.customer_id = A.customer_id
LEFT JOIN Orders O ON C.customer_id = O.customer_id
GROUP BY
C.customer_id, C.customer_name, A.street;
*/
-- Correct Solution: Aggregate orders first, then join.
WITH CustomerOrderSummary AS (
SELECT
customer_id,
COUNT(order_id) AS num_orders
FROM
Orders
GROUP BY
customer_id
)
SELECT
C.customer_name,
A.street,
COS.num_orders -- Correct count per customer
FROM
Customers C
LEFT JOIN
Addresses A ON C.customer_id = A.customer_id
LEFT JOIN
CustomerOrderSummary COS ON C.customer_id = COS.customer_id
ORDER BY
C.customer_name, A.street;
Explanation:
CustomerOrderSummaryCTE: This subquery (or CTE) first calculates the `num_orders` for each `customer_id` directly from theOrderstable. This aggregation happens *before* any potential fan-out.- Main Query: We then join
CustomerswithAddresses(which can still cause duplication of customer details if they have multiple addresses, but it's *expected* if you want to see each address). Crucially, we then join this result with our pre-aggregatedCustomerOrderSummary. - This way, even if a customer has multiple addresses, the
num_ordersdisplayed will be the correct total for that customer, not an inflated value due to the addresses join. If you then wanted to count distinct customers, you'd useCOUNT(DISTINCT C.customer_id)on the final result, or aggregate that separately before joining.
8. Subqueries, CTEs & Query Architecture
As SQL queries become more complex, simply chaining `JOIN` and `WHERE` clauses can become unwieldy. Subqueries and Common Table Expressions (CTEs) offer powerful ways to break down complex problems into smaller, more manageable parts, improving both readability and maintainability.
Subqueries (Nested Queries)
A subquery (or inner query) is a query nested inside another SQL query (the outer query). They can be used in various parts of a SQL statement.
- Scalar Subquery: Returns one value
A subquery that returns a single row and a single column. It can be used anywhere a single value is expected (e.g., in a
SELECTlist,WHEREclause comparison).-- Find employees whose salary is greater than the average salary SELECT employee_name, salary FROM Employees WHERE salary > (SELECT AVG(salary) FROM Employees); -- Scalar subquery returning one value - Column Subquery: Returns a list (`IN` clause)
A subquery that returns a single column but multiple rows. It's commonly used with operators like
IN,NOT IN,ANY,ALL.-- Find employees who work in departments located in 'New York' SELECT E.employee_name, D.department_name FROM Employees E JOIN Departments D ON E.department_id = D.department_id WHERE D.department_id IN (SELECT department_id FROM Departments WHERE location = 'New York'); -- Column subquery returning a list - Row/Table Subquery: Returns a virtual table (`FROM` clause)
A subquery that returns multiple columns and multiple rows, essentially acting as a temporary table. These are often called derived tables when used in the
FROMclause.-- Find employees whose salary is above their department's average SELECT E.employee_name, E.salary, DA.avg_dept_salary FROM Employees E JOIN (SELECT department_id, AVG(salary) AS avg_dept_salary FROM Employees GROUP BY department_id) AS DA -- Table subquery (derived table) ON E.department_id = DA.department_id WHERE E.salary > DA.avg_dept_salary; - Correlated Subquery: Dependent on outer query (Performance hit)
A subquery that depends on the outer query for its values. It executes once for each row processed by the outer query, which can lead to poor performance on large datasets.
-- Find employees who earn more than the average salary in their *own* department SELECT E.employee_name, E.salary, E.department_id FROM Employees E WHERE E.salary > (SELECT AVG(salary) FROM Employees WHERE department_id = E.department_id); -- The inner subquery depends on E.department_id from the outer query.Correlated subqueries are often less performant than joins or CTEs because they are re-evaluated for every row of the outer query. While sometimes necessary, consider alternatives like joins with derived tables or window functions.
Common Table Expressions (CTEs)
CTEs provide a way to define a temporary, named result set that you can reference within a single SQL statement (SELECT, INSERT, UPDATE, DELETE). They significantly improve query readability and can sometimes optimize execution.
- Syntax:
WITH cte_name AS ...WITH DepartmentAvgSalary AS ( SELECT department_id, AVG(salary) AS avg_dept_salary FROM Employees GROUP BY department_id ) SELECT E.employee_name, E.salary, DAS.avg_dept_salary FROM Employees E JOIN DepartmentAvgSalary DAS ON E.department_id = DAS.department_id WHERE E.salary > DAS.avg_dept_salary; - Readability benefits over subqueries:
CTEs make complex queries much easier to read and understand by breaking them into logical, named steps. Instead of deeply nested subqueries, you define each step clearly at the beginning of the query.
- Reusability within the same query:
A single CTE can be referenced multiple times within the same subsequent query. You can also chain CTEs, where one CTE references a previously defined CTE.
WITH HighEarners AS ( SELECT employee_id, employee_name, salary FROM Employees WHERE salary > 90000 ), EngineeringHighEarners AS ( SELECT HE.employee_id, HE.employee_name, HE.salary, D.department_name FROM HighEarners HE JOIN Departments D ON HE.employee_id = D.employee_id -- (Assuming a joinable department_id in HighEarners) WHERE D.department_name = 'Engineering' ) SELECT * FROM EngineeringHighEarners;Note: The example above is conceptual. ForEngineeringHighEarnersto joinDepartments,HighEarnerswould need to also include `department_id`.
Best Practices
Decision Tree for Choosing CTEs vs. Subqueries vs. Temp Tables
When should you use which technique?
- Subqueries: Best for simple, one-off filtering conditions or scalar values. Can get difficult to read when deeply nested.
- CTEs: Excellent for modularizing complex queries, improving readability, and enabling recursive queries. They are typically optimized similarly to derived tables.
- Temporary Tables: Useful for very large intermediate results that need to be indexed, or for data that needs to be used across multiple, separate queries within the same session. They incur more overhead (creation, storage, cleanup) than CTEs.
Practice & Application
🎯 Practice: Scalar Subquery - Employees Above Average Salary
Retrieve the names and salaries of all employees who earn more than the overall average salary of *all* employees in the company.
Click for Solution
This problem uses a scalar subquery. The inner query calculates a single value (the overall average salary), which the outer query then uses for comparison in its WHERE clause.
SELECT
E.first_name,
E.last_name,
E.salary
FROM
Employees E
WHERE
E.salary > (SELECT AVG(salary) FROM Employees); -- Scalar subquery returns single average salary
Explanation:
- The subquery
(SELECT AVG(salary) FROM Employees)executes first, computing the average salary across the entireEmployeestable. - This single average value is then used by the outer
SELECTstatement'sWHEREclause to filter employees.
🎯 Practice: Column Subquery - Departments with Recent Hires
List the names of all departments that have hired at least one employee in the year 2023.
Click for Solution
This problem can be solved efficiently using a column subquery within an IN clause. The subquery will return a list of department IDs that satisfy the condition, which the outer query then uses to filter departments.
-- Assume Departments (department_id, department_name)
-- Assume Employees (employee_id, first_name, last_name, hire_date, department_id)
SELECT
D.department_name
FROM
Departments D
WHERE
D.department_id IN (SELECT DISTINCT department_id
FROM Employees
WHERE hire_date BETWEEN '2023-01-01' AND '2023-12-31'); -- Column subquery returns list of IDs
Explanation:
- The inner subquery identifies all unique
department_ids from theEmployeestable where thehire_datefalls within 2023. - The outer query then selects
department_namefrom theDepartmentstable, including only those departments whosedepartment_idis found in the list generated by the subquery.
🎯 Practice: Employees Earning More Than Department Average (CTE vs. Derived Table)
Find the names and salaries of employees who earn more than the average salary of their own department. Show their salary and their department's average salary.
Click for Solution
This is a classic problem that nicely illustrates the advantage of CTEs for readability compared to a derived table (a subquery in the FROM clause). Both approaches are shown.
Solution 1: Using a Derived Table (Subquery in FROM)
SELECT
E.first_name,
E.last_name,
E.salary,
DA.avg_dept_salary
FROM
Employees E
JOIN
(SELECT department_id, AVG(salary) AS avg_dept_salary
FROM Employees
GROUP BY department_id) AS DA -- Derived Table (Table Subquery)
ON E.department_id = DA.department_id
WHERE
E.salary > DA.avg_dept_salary;
Solution 2: Using a Common Table Expression (CTE)
WITH DepartmentAverage AS (
SELECT
department_id,
AVG(salary) AS avg_dept_salary
FROM
Employees
GROUP BY
department_id
)
SELECT
E.first_name,
E.last_name,
E.salary,
DA.avg_dept_salary
FROM
Employees E
JOIN
DepartmentAverage DA ON E.department_id = DA.department_id
WHERE
E.salary > DA.avg_dept_salary;
Explanation:
- Both solutions first calculate the average salary per department.
- The Derived Table approach nests this calculation directly into the
FROMclause, requiring an alias (`DA`). It can sometimes become hard to read with many nested subqueries. - The CTE approach uses
WITH DepartmentAverage AS (...)to define this intermediate result set with a clear name upfront. The mainSELECTquery then references this CTE as if it were a regular table. This greatly enhances readability and modularity, especially for more complex multi-step logic.
🎯 Practice: Chained CTEs - Top 2 Highest Paid Employees Per Department
Find the top 2 highest-paid employees in each department. Include their name, salary, department name, and their rank within their department.
Click for Solution
This problem is best solved using chained CTEs combined with window functions (which will be covered in more detail later, but this demonstrates a common pattern). We'll use ROW_NUMBER() to rank employees within each department.
-- Assume Employees (employee_id, first_name, last_name, salary, department_id)
-- Assume Departments (department_id, department_name)
WITH EmployeeRanked AS (
SELECT
E.employee_id,
E.first_name,
E.last_name,
E.salary,
E.department_id,
ROW_NUMBER() OVER (PARTITION BY E.department_id ORDER BY E.salary DESC) AS rank_in_dept
FROM
Employees E
),
Top2PerDepartment AS (
SELECT
ER.employee_id,
ER.first_name,
ER.last_name,
ER.salary,
ER.department_id,
ER.rank_in_dept
FROM
EmployeeRanked ER
WHERE
ER.rank_in_dept <= 2 -- Filter for the top 2 ranked employees
)
SELECT
T2.first_name,
T2.last_name,
T2.salary,
D.department_name,
T2.rank_in_dept
FROM
Top2PerDepartment T2
JOIN
Departments D ON T2.department_id = D.department_id
ORDER BY
D.department_name, T2.rank_in_dept;
Explanation:
EmployeeRankedCTE: This first CTE calculates a rank for each employee within their respective department.PARTITION BY E.department_iddivides the data into separate groups for each department, andORDER BY E.salary DESCranks employees within each group by salary in descending order.Top2PerDepartmentCTE: This second CTE builds upon the first. It selects all columns fromEmployeeRankedand applies a filter to keep only those employees whoserank_in_deptis 1 or 2.- Final
SELECT: The main query then joinsTop2PerDepartmentwith theDepartmentstable to fetch the actual department names and presents the final, desired output, ordered for readability. This multi-step approach is far clearer than nesting many subqueries.
9. Database Constraints & Indexing Strategy
Ensuring data quality and optimizing database performance are two critical aspects of working with SQL. Constraints guarantee the integrity of your data, while indexes dramatically speed up data retrieval.
Data Integrity Constraints
Constraints are rules enforced on data columns in a table. They are used to limit the type of data that can go into a table, ensuring the accuracy and reliability of the data.
`NOT NULL` (Mandatory fields)
Ensures that a column cannot have a NULL value. The field must contain a value.
CREATE TABLE Users (
user_id INT PRIMARY KEY,
username VARCHAR(50) NOT NULL -- username cannot be NULL
);
`UNIQUE` (Preventing duplicates)
Ensures that all values in a column (or a set of columns) are unique. It allows NULLs (though typically only one NULL value depending on the RDBMS).
CREATE TABLE Products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100) UNIQUE -- product_name must be unique
);
`PRIMARY KEY` (Unique + Not Null)
A column or set of columns that uniquely identifies each row in a table. It implicitly enforces both UNIQUE and NOT NULL constraints.
CREATE TABLE Orders (
order_id INT PRIMARY KEY, -- order_id is unique and cannot be NULL
order_date DATE
);
`FOREIGN KEY` (Relationship enforcement)
A column or set of columns in one table that refers to the PRIMARY KEY (or a UNIQUE key) in another table. It establishes a link between tables and enforces referential integrity.
CREATE TABLE OrderItems (
item_id INT PRIMARY KEY,
order_id INT,
product_id INT,
FOREIGN KEY (order_id) REFERENCES Orders(order_id)
-- This ensures order_id must exist in the Orders table
);
`CHECK` (Custom validation logic)
Ensures that all values in a column satisfy a specific condition. You define the logical expression.
CREATE TABLE Employees (
employee_id INT PRIMARY KEY,
salary DECIMAL(10, 2) CHECK (salary > 0) -- Salary must be positive
);
`DEFAULT` (Fallback values)
Provides a default value for a column when no value is explicitly specified during an INSERT operation.
CREATE TABLE Tasks (
task_id INT PRIMARY KEY,
task_description TEXT,
status VARCHAR(20) DEFAULT 'Pending' -- Default status is 'Pending'
);
Indexing Fundamentals (Under the Hood)
An index is a special lookup table that the database search engine can use to speed up data retrieval. Without indexes, the database engine would have to scan every row in a table to find matching data, which is slow for large tables.
What is an Index? (The Book Index Analogy)
Think of an index in a database like the index at the back of a textbook:
- Without an index: If you want to find all mentions of "SQL Joins" in a textbook without an index, you'd have to read every single page from cover to cover. This is a "full table scan" in database terms.
- With an index: You go to the index, find "SQL Joins", see it's on pages 57, 123, and 201. You immediately jump to those pages. This is how a database uses an index to quickly locate relevant rows.
-- Creating a simple index on a column
CREATE INDEX idx_employee_lastname
ON Employees (last_name);
-- Creating a composite index on multiple columns
CREATE INDEX idx_order_customer_date
ON Orders (customer_id, order_date DESC);
Clustered (Physical order) vs. Non-Clustered (Logical pointer)
These are the two main types of indexes, differing in how they store and manage data.
🌳 Clustered Index
- Physical Storage: Determines the physical order of data rows in the table. The table data itself is stored in the order of the clustered index.
- One Per Table: A table can have only one clustered index because the data rows can only be physically sorted in one order.
- Primary Key: Often, the
PRIMARY KEYautomatically creates a clustered index. - Best for: Range scans,
ORDER BYclauses, frequently accessed sequential data.
📑 Non-Clustered Index
- Logical Pointer: Does not change the physical order of the data. It's a separate structure containing the indexed column(s) and pointers (row locators) to the actual data rows in the table.
- Multiple Per Table: A table can have many non-clustered indexes.
- Best for: Specific lookups (
WHERE column = 'value'), joins (on the indexed columns).
Trade-offs: Read speed increase vs. Write/Update speed decrease
Indexes are not a free lunch. While they significantly boost read performance (SELECT statements), they come with costs:
- ✅ Pros (Reads):
- Faster data retrieval for queries using
WHEREclauses,JOINconditions,ORDER BY,GROUP BYon indexed columns. - Reduces I/O operations by allowing direct access to relevant data instead of full table scans.
- Faster data retrieval for queries using
- ❌ Cons (Writes/Updates):
- Increased Storage: Indexes consume disk space (as they are separate data structures).
- Slower DML Operations: Every time you
INSERT,UPDATE, orDELETEa row in the main table, the corresponding indexes also need to be updated. This adds overhead and slows down write operations. - Maintenance Overhead: The database engine needs to maintain indexes, which can impact performance, especially for tables with high write activity.
B-Tree Structure Diagram (Simplified)
Most SQL databases use B-tree (Balanced Tree) structures for their indexes. This structure allows for efficient searching, insertion, and deletion.
(Data Pointers)
(Data Pointers)
(Data Pointers)
(Data Pointers)
Practice & Application
🎯 Practice: Designing a Table with Comprehensive Constraints
Design and create a SQL table named CourseEnrollments with the following requirements:
- An auto-incrementing
enrollment_idthat serves as the primary key. - A
student_id(integer) and acourse_id(integer) that together form a unique composite key (a student can enroll in a course only once). student_idmust reference thestudent_idin an existingStudentstable.course_idmust reference thecourse_idin an existingCoursestable.- An
enrollment_datethat defaults to the current date. - A
gradecolumn (VARCHAR(2)) that can be NULL initially but, if provided, must be one of 'A+', 'A', 'B+', 'B', 'C+', 'C', 'D', 'F'.
Click for Solution
This problem demonstrates the creation of a table incorporating multiple data integrity constraints: `PRIMARY KEY`, `FOREIGN KEY`, `UNIQUE` (composite), `DEFAULT`, and `CHECK`.
-- First, assume Students and Courses tables already exist with appropriate PKs:
-- CREATE TABLE Students (student_id INT PRIMARY KEY, student_name VARCHAR(100));
-- CREATE TABLE Courses (course_id INT PRIMARY KEY, course_name VARCHAR(100));
CREATE TABLE CourseEnrollments (
enrollment_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- PostgreSQL syntax for auto-increment PK
-- (For MySQL, use INT AUTO_INCREMENT PRIMARY KEY)
-- (For SQL Server, use INT IDENTITY(1,1) PRIMARY KEY)
student_id INT NOT NULL,
course_id INT NOT NULL,
enrollment_date DATE DEFAULT CURRENT_DATE,
grade VARCHAR(2) CHECK (grade IN ('A+', 'A', 'B+', 'B', 'C+', 'C', 'D', 'F')),
-- Composite UNIQUE key for student_id and course_id
CONSTRAINT UQ_StudentCourse UNIQUE (student_id, course_id),
-- Foreign Key referencing Students table
CONSTRAINT FK_StudentEnrollment FOREIGN KEY (student_id) REFERENCES Students(student_id),
-- Foreign Key referencing Courses table
CONSTRAINT FK_CourseEnrollment FOREIGN KEY (course_id) REFERENCES Courses(course_id)
);
Explanation:
enrollment_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEYdefines an auto-incrementing primary key.student_id INT NOT NULLandcourse_id INT NOT NULLensure these critical identifiers are never missing.UNIQUE (student_id, course_id)forms a composite unique key, preventing a student from being enrolled in the same course more than once.DEFAULT CURRENT_DATEautomatically populates the `enrollment_date` if not provided.CHECK (grade IN (...))restricts the allowed values for the `grade` column.FOREIGN KEYconstraints link to the `Students` and `Courses` tables, enforcing referential integrity.
🎯 Practice: Modifying Existing Constraints and Column Definitions
You have an existing Users table defined as:
CREATE TABLE Users (
user_id INT PRIMARY KEY,
username VARCHAR(50) UNIQUE,
email VARCHAR(100),
registration_date DATE
);
- Make the
emailcolumn `NOT NULL`, assuming all existing rows already have a non-NULL email. - Add a new `last_login_timestamp` column (TIMESTAMP) which should automatically set its value to the current timestamp if not explicitly provided during an INSERT.
Click for Solution
This problem uses `ALTER TABLE` to modify column constraints and add a new column with a default value, demonstrating how to evolve a schema over time.
-- 1. Make the 'email' column NOT NULL
ALTER TABLE Users
ALTER COLUMN email SET NOT NULL; -- PostgreSQL syntax
-- For SQL Server: ALTER TABLE Users ALTER COLUMN email VARCHAR(100) NOT NULL;
-- For MySQL: ALTER TABLE Users MODIFY COLUMN email VARCHAR(100) NOT NULL;
-- 2. Add 'last_login_timestamp' with a default value
ALTER TABLE Users
ADD COLUMN last_login_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP; -- Or NOW() in MySQL/SQL Server
Explanation:
- The `ALTER COLUMN email SET NOT NULL` command (or its RDBMS-specific equivalent) enforces that future inserts or updates to the `email` column cannot be `NULL`. This operation would fail if there were any existing `NULL` values in the `email` column.
- The `ADD COLUMN last_login_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP` command adds a new column and ensures that if an `INSERT` statement doesn't provide a value for this column, the database will automatically stamp it with the time of insertion.
🎯 Practice: Strategic Indexing for Query Performance
Consider a large Transactions table with millions of rows, containing columns like transaction_id (PK), customer_id, transaction_date, amount, and status. A frequently run query looks like this:
SELECT transaction_id, transaction_date, amount
FROM Transactions
WHERE customer_id = 12345
AND transaction_date BETWEEN '2023-01-01' AND '2023-01-31'
AND status = 'completed'
ORDER BY transaction_date DESC;
Suggest a single composite index that would most effectively speed up this specific query, explaining why.
Click for Solution
This problem tests your ability to design an effective index by considering the `WHERE` clause conditions and the `ORDER BY` clause.
CREATE INDEX idx_transactions_customer_date_status_amount
ON Transactions (customer_id, transaction_date DESC, status, amount);
Explanation:
customer_idfirst: This is the most selective column (often), and it's used with an equality operator (`=`). The index will first efficiently narrow down to transactions for a single customer.transaction_date DESCsecond: After finding the customer, the query filters by a date range and orders by `transaction_date DESC`. Placing this second in the index (and matching its sort order) allows the database to efficiently find the relevant dates and fulfill the `ORDER BY` clause directly from the index without an additional sort step.statusthird: This is another equality filter (`status = 'completed'`). Placing it after `customer_id` and `transaction_date` is effective because the data is already pre-sorted for the leading columns, further refining the search.amountlast: While `amount` is in the `SELECT` list, including it as a "covering index" (an index that contains all columns needed by the query) can allow the database to answer the query entirely from the index without having to go back to the main table, potentially speeding it up even further. It's listed last because it's not used in `WHERE` or `ORDER BY`.
Why this order? Indexes are most effective when their leading columns match the most restrictive conditions in the WHERE clause. The `ORDER BY` clause also benefits greatly from a matching index order.
🎯 Practice: Discussing Indexing Trade-offs - When NOT to Index
You are managing a `SessionLogs` table. This table receives millions of new records every hour (very high write volume). It has columns such as `session_id` (PK), `user_id`, `timestamp`, `event_type`, and `session_status` (which can only be 'Active', 'Inactive', 'Error').
Your team proposes adding an index on the `session_status` column to speed up queries like `SELECT COUNT(*) FROM SessionLogs WHERE session_status = 'Active';`
Should you implement this index? Discuss the trade-offs and explain your reasoning.
Click for Solution
This problem requires evaluating the costs and benefits of an index in a high-write, low-cardinality scenario, demonstrating a practical understanding of indexing trade-offs.
Reasoning:
- ❌ High Write Volume Impact: The table receives millions of records per hour. Every `INSERT` operation would require updating the `session_status` index. This overhead would significantly slow down write performance, which is a critical concern for this table.
- ❌ Low Cardinality Column: The `session_status` column has very low cardinality (only three distinct values: 'Active', 'Inactive', 'Error'). An index is most effective when it quickly narrows down the search space. For a low-cardinality column, an index on `session_status` would point to a large percentage of the table's rows (e.g., if 50% are 'Active', the index will still lead to scanning half the table). In such cases, the database's query optimizer might determine that a full table scan is *more efficient* than traversing the index and then fetching half the data pages, especially if the data is already physically grouped.
- ✅ Potential Read Benefit (but limited): While queries filtering by `session_status` might theoretically benefit, the actual gain might be minimal due to low cardinality. For `COUNT(*)`, the index might only help slightly if it's a covering index for the count, but it won't drastically speed up filtering for a large proportion of the table.
Conclusion:
Given the extremely high write volume and the low cardinality of the `session_status` column, adding an index on `session_status` is likely **not a good idea**. The performance penalty on `INSERT` operations would outweigh the minimal (if any) read performance gains.
Alternative considerations: For such scenarios, partitioning the table by `timestamp` might be more effective for managing large data volumes and specific date-range queries. If reads on `session_status` are absolutely critical and slow, a materialized view that pre-aggregates counts for each status might be a better approach than a direct index on a low-cardinality column in a high-write table.
10. Set Operations
Set operations in SQL allow you to combine the results of two or more SELECT statements into a single result set. Unlike joins, which combine data horizontally based on related columns, set operations combine data vertically, stacking one result set on top of another.
Vertical Combination of Data
Let's explore the primary set operators:
-- Assume two tables with similar structures:
-- Employees (employee_id, name, city)
-- Contractors (contractor_id, name, city)
-- Both have a 'name' and 'city' column for demonstration.
UNION(Removes duplicates, slower):Combines the result sets of two or more
SELECTstatements and returns all distinct rows. Any duplicate rows across the combined result sets are removed.SELECT name, city FROM Employees UNION SELECT name, city FROM Contractors; -- Result: All unique names and cities from both tables. If 'John, New York' appears in both, it will only be listed once.UNION ALL(Keeps duplicates, faster):Combines the result sets of two or more
SELECTstatements, but unlikeUNION, it retains all duplicate rows. This makes it generally faster as the database doesn't need to perform a distinct sort operation.SELECT name, city FROM Employees UNION ALL SELECT name, city FROM Contractors; -- Result: All names and cities from both tables, including duplicates. If 'John, New York' appears in both, it will be listed twice.INTERSECT(Rows in both sets):Returns only the rows that are present in both result sets. Only rows that are identical in all selected columns will be returned.
SELECT name, city FROM Employees INTERSECT SELECT name, city FROM Contractors; -- Result: Only names and cities that exist in BOTH Employees AND Contractors.EXCEPT/MINUS(Rows in A but not B):Returns rows from the first
SELECTstatement that are not found in the secondSELECTstatement. (MINUSis used in Oracle,EXCEPTin SQL Server, PostgreSQL, MySQL 8.0+).SELECT name, city FROM Employees EXCEPT -- Or MINUS in Oracle SELECT name, city FROM Contractors; -- Result: Names and cities that are in Employees but NOT in Contractors.
Requirements
For set operations to work correctly, the participating SELECT statements must adhere to two crucial rules:
- ✅ Column Count Match:
The number of columns in all
SELECTstatements being combined must be identical.-- This would fail because the second SELECT has fewer columns -- SELECT name, city FROM Employees -- UNION -- SELECT name FROM Contractors; -- ERROR: column mismatch - ✅ Data Type Compatibility:
The data types of corresponding columns in each
SELECTstatement must be compatible (e.g., you can't `UNION` an `INT` with a `VARCHAR` in the same column position). Implicit conversion often occurs if types are related (e.g., `INT` to `DECIMAL`), but it's best to be explicit or ensure types are similar.-- This might lead to unexpected results or errors if type mismatch is severe -- SELECT employee_id, name FROM Employees -- UNION -- SELECT date_of_birth, full_name FROM Contacts; -- ERROR or weird conversion
Edge Case: Sorting results in Set Operations
When you use UNION, UNION ALL, INTERSECT, or EXCEPT, the ORDER BY clause can only be specified once, at the very end of the entire combined query. It applies to the final, consolidated result set.
-- Incorrect: ORDER BY inside each SELECT
-- (SELECT name FROM Employees ORDER BY name) UNION (SELECT name FROM Contractors ORDER BY name); -- Syntax Error
-- Correct: ORDER BY at the end, referring to columns by their final selected name/position
SELECT name, city FROM Employees
UNION ALL
SELECT name, city FROM Contractors
ORDER BY name ASC, city DESC; -- Applies to the combined result set
Practice & Application
🎯 Practice: Combined Roster (Including Duplicates)
You have two tables: Employees (employee_id, name, department, email) and Contractors (contractor_id, name, department, email). You need to create a single, consolidated list of all individuals (both employees and contractors), showing their name, department, and email. It's important to include all entries, even if an individual (by name and email) appears in both tables.
Click for Solution
This problem requires `UNION ALL`. It combines the result sets of two (or more) queries and retains all duplicate rows. This makes it efficient as the database does not need to perform a costly distinct check.
-- Sample Data (conceptual):
-- Employees: { ('E1', 'Alice', 'HR', 'alice@example.com'), ('E2', 'Bob', 'IT', 'bob@example.com'), ('E3', 'Charlie', 'Sales', 'charlie@example.com') }
-- Contractors: { ('C1', 'Bob', 'IT', 'bob@example.com'), ('C2', 'David', 'Marketing', 'david@example.com') }
SELECT name, department, email FROM Employees
UNION ALL
SELECT name, department, email FROM Contractors
ORDER BY name;
Explanation:
- The first
SELECTretrieves `name`, `department`, `email` fromEmployees. - The second
SELECTretrieves the same columns fromContractors. UNION ALLstacks these two result sets on top of each other. If 'Bob' with email 'bob@example.com' exists in both, he will appear twice.- The final
ORDER BY namesorts the entire combined result set.
Expected Output (Logical, order by name):
| name | department | |
|---|---|---|
| Alice | HR | alice@example.com |
| Bob | IT | bob@example.com |
| Bob | IT | bob@example.com |
| Charlie | Sales | charlie@example.com |
| David | Marketing | david@example.com |
🎯 Practice: Unique Contact List (Removing Duplicates)
Using the same Employees and Contractors tables, you now need a unique list of all contact emails (excluding names or departments). If an email address is associated with both an employee and a contractor, it should only appear once.
Click for Solution
This problem calls for `UNION`. It combines result sets and automatically removes duplicate rows, ensuring each distinct row appears only once in the final output.
-- Sample Data (conceptual):
-- Employees: { ('E1', 'Alice', 'HR', 'alice@example.com'), ('E2', 'Bob', 'IT', 'bob@example.com'), ('E3', 'Charlie', 'Sales', 'charlie@example.com') }
-- Contractors: { ('C1', 'Bob', 'IT', 'bob@example.com'), ('C2', 'David', 'Marketing', 'david@example.com') }
SELECT email FROM Employees
UNION
SELECT email FROM Contractors
ORDER BY email;
Explanation:
- Both
SELECTstatements retrieve only the `email` column. UNIONcombines these lists and then performs a distinct operation, removing any `email` that appears in both original lists.- The final
ORDER BY emailsorts the unique email addresses.
Expected Output (Logical, order by email):
| alice@example.com |
| bob@example.com |
| charlie@example.com |
| david@example.com |
🎯 Practice: Finding Dual-Role Individuals (Intersection)
Your team suspects that some individuals might be listed in both the Employees table and the Contractors table, potentially indicating a data anomaly or a special dual role. Identify the names and emails of all individuals who exist in *both* tables.
Click for Solution
To find common rows between two result sets, the `INTERSECT` operator is used. It returns only the rows that are present in both the first and second `SELECT` statements.
-- Sample Data (conceptual):
-- Employees: { ('E1', 'Alice', 'HR', 'alice@example.com'), ('E2', 'Bob', 'IT', 'bob@example.com'), ('E3', 'Charlie', 'Sales', 'charlie@example.com') }
-- Contractors: { ('C1', 'Bob', 'IT', 'bob@example.com'), ('C2', 'David', 'Marketing', 'david@example.com'), ('C3', 'Alice', 'Sales', 'alice@example.com') }
-- (Note: Alice in Contractors has a different department, but same name/email)
SELECT name, email FROM Employees
INTERSECT
SELECT name, email FROM Contractors
ORDER BY name;
Explanation:
INTERSECTcompares the rows produced by the twoSELECTstatements.- Only rows where *all* selected columns (`name` and `email`) are identical in both sets are included in the final result.
- In our sample data, 'Bob' and 'Alice' are present with the same name and email in both lists.
Expected Output (Logical, order by name):
| name | |
|---|---|
| Alice | alice@example.com |
| Bob | bob@example.com |
🎯 Practice: Exclusive Employees (Difference)
You need to identify employees who are *only* in the Employees table and have no corresponding entry (by name and email) in the Contractors table. List their name and email.
Click for Solution
The `EXCEPT` (or `MINUS` in Oracle) operator is used here. It returns rows from the first query that are not present in the second query.
-- Sample Data (conceptual):
-- Employees: { ('E1', 'Alice', 'HR', 'alice@example.com'), ('E2', 'Bob', 'IT', 'bob@example.com'), ('E3', 'Charlie', 'Sales', 'charlie@example.com') }
-- Contractors: { ('C1', 'Bob', 'IT', 'bob@example.com'), ('C2', 'David', 'Marketing', 'david@example.com'), ('C3', 'Alice', 'Sales', 'alice@example.com') }
SELECT name, email FROM Employees
EXCEPT -- Use MINUS in Oracle
SELECT name, email FROM Contractors
ORDER BY name;
Explanation:
- The first
SELECTgets all names and emails fromEmployees. - The second
SELECTgets all names and emails fromContractors. EXCEPTthen subtracts the second set from the first. Any `(name, email)` pair that exists in both will be removed from the first set's result.- Only 'Charlie' remains because 'Alice' and 'Bob' (with their respective emails) are also found in the
Contractorstable.
Expected Output (Logical, order by name):
| name | |
|---|---|
| Charlie | charlie@example.com |
11. Views and Stored Procedures
Beyond raw tables and ad-hoc queries, databases offer mechanisms to encapsulate logic and improve performance, security, and reusability. Views and Stored Procedures are two such powerful features.
Views (Virtual Tables)
A view is a virtual table based on the result-set of a SQL query. It contains rows and columns, just like a real table, but it does not store data itself. Instead, it retrieves data from the underlying tables whenever it is queried.
CREATE VIEW ActiveEmployees AS
SELECT
employee_id,
first_name,
last_name,
email,
department_id
FROM
Employees
WHERE
status = 'Active';
-- You can now query the view just like a table:
SELECT * FROM ActiveEmployees WHERE department_id = 1;
- Hiding complexity / Business logic encapsulation:
Views allow you to pre-package complex queries (e.g., involving multiple joins, aggregations, or complex
CASEstatements) and present them as a simple table. This simplifies interactions for users or applications who don't need to understand the underlying complexity.-- Instead of users writing this complex query every time: SELECT C.customer_name, SUM(O.total_amount) AS total_spent, COUNT(O.order_id) AS num_orders FROM Customers C LEFT JOIN Orders O ON C.customer_id = O.customer_id GROUP BY C.customer_name HAVING COUNT(O.order_id) > 0; -- You can define a view: CREATE VIEW CustomerSpendingSummary AS SELECT C.customer_name, SUM(O.total_amount) AS total_spent, COUNT(O.order_id) AS num_orders FROM Customers C LEFT JOIN Orders O ON C.customer_id = O.customer_id GROUP BY C.customer_name HAVING COUNT(O.order_id) > 0; -- And then users just query the view: SELECT * FROM CustomerSpendingSummary WHERE total_spent > 1000; - Security (Restricting access to specific columns/rows):
Views can act as a security mechanism. Instead of granting users direct access to sensitive base tables, you can grant them access only to a view that exposes a subset of columns or rows. For example, a view might exclude salary or personal identifying information.
-- Granting a user access to the Employees table exposes ALL columns. -- Instead, create a view that hides sensitive data: CREATE VIEW PublicEmployeeInfo AS SELECT employee_id, first_name, last_name, department_id FROM Employees; -- Then grant access ONLY to the view: -- GRANT SELECT ON PublicEmployeeInfo TO 'reporting_user'; - Materialized Views (Cached data for performance - Conceptual):
While standard views are virtual and re-execute their query every time they are accessed, materialized views (supported in some RDBMS like PostgreSQL, Oracle, SQL Server) are different. They physically store the result of the query at the time of their creation or last refresh.
- ✅ Pros: Much faster read performance, especially for complex aggregations, as the data is pre-calculated.
- ❌ Cons: The data in a materialized view can become stale. It needs to be explicitly refreshed (manually or on a schedule) to reflect changes in the underlying base tables.
-- PostgreSQL syntax example: CREATE MATERIALIZED VIEW DailySalesSummary AS SELECT order_date, SUM(total_amount) AS daily_revenue, COUNT(order_id) AS total_orders FROM Orders GROUP BY order_date; -- To update its data: -- REFRESH MATERIALIZED VIEW DailySalesSummary;
Stored Procedures
A stored procedure is a prepared SQL code that you can save, so the code can be reused over and over again. It's a collection of SQL statements and control-of-flow statements (like `IF`, `LOOP`).
-- PostgreSQL syntax example (syntax varies significantly by RDBMS)
CREATE PROCEDURE GetEmployeeCountByDepartment(
IN dept_id INT,
OUT employee_count INT
)
LANGUAGE plpgsql
AS $$
BEGIN
SELECT COUNT(*) INTO employee_count
FROM Employees
WHERE department_id = dept_id;
END;
$$;
-- To execute (PostgreSQL):
-- CALL GetEmployeeCountByDepartment(1, NULL); -- Result will be in a message or output parameter
-- For SQL Server: EXEC GetEmployeeCountByDepartment @dept_id = 1, @employee_count = @count OUTPUT;
- Pre-compiled SQL code:
When a stored procedure is created, the database compiles it once and stores the execution plan. Subsequent executions use this pre-compiled plan, which can result in faster performance compared to executing raw SQL statements repeatedly.
- Accepting parameters:
Stored procedures can accept input parameters, allowing for dynamic and flexible execution. They can also return output parameters or result sets.
-- Example with input and output parameters CREATE PROCEDURE UpdateEmployeeSalary ( IN employee_id_param INT, IN percentage_raise DECIMAL(5, 2), OUT new_salary_out DECIMAL(10, 2) ) LANGUAGE plpgsql AS $$ BEGIN UPDATE Employees SET salary = salary * (1 + percentage_raise) WHERE employee_id = employee_id_param; SELECT salary INTO new_salary_out FROM Employees WHERE employee_id = employee_id_param; END; $$; -- CALL UpdateEmployeeSalary(101, 0.05, NULL); - Reducing network traffic:
Instead of sending multiple complex SQL statements over the network, an application can simply send a single command to execute a stored procedure. The procedure then executes on the database server, reducing network latency and traffic.
- Security: Similar to views, stored procedures can provide an extra layer of security. Users can be granted permission to execute a procedure without having direct permissions on the underlying tables, protecting sensitive data and enforcing business rules.
- Encapsulation of Business Logic: Complex business rules and workflows can be implemented directly within the database, ensuring consistency across all applications that interact with the data.
Practice & Application
🎯 Practice: Creating a Restricted Employee View
Your company's HR department needs a simplified view of employee data. Create a view named HR_Employee_Overview that includes employee_id, first_name, last_name, department_id, and hire_date for all active employees. This view should explicitly exclude sensitive information like salaries or full email addresses.
Click for Solution
This problem demonstrates how to use a view to simplify complex queries and enhance security by restricting access to a subset of columns and rows from a base table.
-- Assume an existing 'Employees' table with:
-- employee_id, first_name, last_name, salary, department_id, hire_date, email, is_active
CREATE VIEW HR_Employee_Overview AS
SELECT
employee_id,
first_name,
last_name,
department_id,
hire_date
FROM
Employees
WHERE
is_active = TRUE; -- Assuming a boolean 'is_active' column
-- To use the view:
-- SELECT * FROM HR_Employee_Overview WHERE department_id = 5;
Explanation:
- The
CREATE VIEWstatement defines a virtual table based on the providedSELECTquery. - Only the specified columns (`employee_id`, `first_name`, `last_name`, `department_id`, `hire_date`) are made available through the view, effectively hiding
salaryandemail. - The `WHERE is_active = TRUE` clause pre-filters the rows, ensuring only active employees are ever seen through this view.
- HR staff can now query `HR_Employee_Overview` as if it were a regular table, without needing to write the complex `SELECT` statement or having direct access to sensitive data in the underlying `Employees` table.
🎯 Practice: Customer Spending Summary View
Create a view named Customer_Revenue_Summary that shows each customer's name, their total number of orders, and their total lifetime spending. Only include customers who have placed at least one order. Link this to the Departments table to also show the department name for employees.
Click for Solution
This problem demonstrates creating a view that encapsulates more complex logic, including joins and aggregation, making a derived dataset readily available for reporting.
-- Assume Customers (customer_id, customer_name, email)
-- Assume Orders (order_id, customer_id, order_date, total_amount, status)
CREATE VIEW Customer_Revenue_Summary AS
SELECT
C.customer_name,
COUNT(O.order_id) AS total_orders,
SUM(O.total_amount) AS total_spending
FROM
Customers C
INNER JOIN -- Use INNER JOIN to only include customers who have placed orders
Orders O ON C.customer_id = O.customer_id
GROUP BY
C.customer_name
ORDER BY
total_spending DESC; -- ORDER BY is allowed within the view definition, but usually applied to query against view. This is for presentation.
-- To use the view:
-- SELECT customer_name, total_spending FROM Customer_Revenue_Summary WHERE total_orders > 5;
Explanation:
- The view performs an
INNER JOINbetweenCustomersandOrdersto link customers to their transactions. GROUP BY C.customer_nameaggregates the results for each unique customer.COUNT(O.order_id)andSUM(O.total_amount)calculate the total orders and spending per customer, respectively.- This view simplifies access to customer revenue metrics, allowing analysts to query a single, pre-calculated source without needing to re-write the joins and aggregations.
🎯 Practice: Stored Procedure to Update Employee Department
Create a stored procedure named TransferEmployee that takes an employee_id and a new_department_id as input. The procedure should update the specified employee's department. It should also include a simple check to ensure the new_department_id actually exists in the Departments table.
Click for Solution
This problem demonstrates creating a stored procedure with input parameters, DML operations (`UPDATE`), and basic conditional logic for data validation.
-- Assume Employees (employee_id, first_name, department_id)
-- Assume Departments (department_id, department_name)
-- PostgreSQL syntax (syntax varies significantly by RDBMS, e.g., SQL Server uses CREATE PROCEDURE)
CREATE OR REPLACE PROCEDURE TransferEmployee(
IN p_employee_id INT,
IN p_new_department_id INT
)
LANGUAGE plpgsql
AS $$
DECLARE
dept_exists BOOLEAN;
BEGIN
-- Check if the new department ID exists
SELECT EXISTS (SELECT 1 FROM Departments WHERE department_id = p_new_department_id)
INTO dept_exists;
IF dept_exists THEN
UPDATE Employees
SET department_id = p_new_department_id
WHERE employee_id = p_employee_id;
RAISE NOTICE 'Employee % transferred to Department % successfully.', p_employee_id, p_new_department_id;
ELSE
RAISE EXCEPTION 'Department ID % does not exist.', p_new_department_id;
END IF;
END;
$$;
-- To execute (PostgreSQL):
-- CALL TransferEmployee(101, 2); -- Should succeed
-- CALL TransferEmployee(102, 999); -- Should raise an exception/error
Explanation:
- The procedure
TransferEmployeeaccepts two input parameters: `p_employee_id` and `p_new_department_id`. - It first declares a variable `dept_exists` to store the result of a check.
- An `EXISTS` subquery efficiently verifies if the `p_new_department_id` is valid by looking it up in the `Departments` table.
- An `IF/ELSE` block then decides whether to proceed with the `UPDATE` (if the department exists) or raise an exception (if it doesn't). This basic error handling improves data integrity.
RAISE NOTICEprovides feedback on successful execution, whileRAISE EXCEPTIONstops the procedure and signals an error for invalid input.
🎯 Practice: Stored Procedure to Retrieve Filtered Products
Create a stored procedure named GetProductsByPriceRange that takes two input parameters: min_price (DECIMAL) and max_price (DECIMAL). The procedure should return all `product_id`, `product_name`, and `price` for products whose price falls within the specified range (inclusive).
Click for Solution
This problem demonstrates a stored procedure primarily used for data retrieval with parameters, offering a reusable way to fetch filtered results.
-- Assume Products (product_id, product_name, price, category_id)
-- PostgreSQL syntax (syntax varies significantly by RDBMS)
CREATE OR REPLACE PROCEDURE GetProductsByPriceRange(
IN p_min_price DECIMAL(10, 2),
IN p_max_price DECIMAL(10, 2)
)
LANGUAGE plpgsql
AS $$
BEGIN
SELECT
product_id,
product_name,
price
FROM
Products
WHERE
price BETWEEN p_min_price AND p_max_price
ORDER BY
price ASC;
END;
$$;
-- To execute (PostgreSQL):
-- CALL GetProductsByPriceRange(10.00, 50.00);
Explanation:
- The procedure
GetProductsByPriceRangeaccepts `p_min_price` and `p_max_price` as input. - Inside the `BEGIN...END` block, a standard
SELECTquery is executed. - The `WHERE` clause uses the input parameters with the `BETWEEN` operator to filter products within the specified price range.
- The `ORDER BY price ASC` ensures the results are returned in a consistent, sorted manner.
- When called, the procedure executes this `SELECT` query and returns its result set to the caller. This centralizes the query logic and allows applications to request filtered data with simple calls.
12. Window Functions (The "Cheat Code")
Window functions are a powerful and often "advanced" SQL feature that allow you to perform calculations across a set of table rows that are related to the current row. What makes them unique is that, unlike aggregate functions (like SUM() with GROUP BY), window functions do not collapse the rows into a single output row; they retain the individual row details while still providing aggregated or ranked insights.
Conceptual Understanding
- Aggregation without collapsing rows:
With
GROUP BYand aggregate functions, you get one row per group. With window functions, you get one output row for each input row, but each output row includes a calculated value based on a "window" of related rows.-- Example: Get each employee's salary AND the average salary of their department SELECT employee_id, first_name, salary, department_id, AVG(salary) OVER (PARTITION BY department_id) AS avg_dept_salary -- Window function! FROM Employees; -- This returns every employee's row, with an additional column showing their department's average. - The
OVER()clause anatomy:The magic of window functions lies within the
OVER()clause. It defines the "window" or set of rows on which the function operates. It has two main components:PARTITION BY: Divides the result set into partitions (groups) to which the window function is independently applied. Think of it as the "grouping" mechanism for the window.ORDER BY: Orders the rows within each partition. This is crucial for functions that depend on order, like ranking or lead/lag.
Ranking Functions
These functions assign a rank to each row within its partition based on the specified order.
ROW_NUMBER()(Unique increment):Assigns a unique, sequential integer to each row within its partition, starting from 1. If there are ties in the ordering,
ROW_NUMBER()assigns arbitrary but unique ranks.SELECT employee_id, first_name, salary, department_id, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank_in_dept FROM Employees;RANK()(Skips numbers on ties):Assigns the same rank to rows that have identical values in the
ORDER BYclause. If there are ties, the next rank value will "skip" numbers (e.g., if two rows are rank 1, the next unique rank will be 3).DENSE_RANK()(No skipping on ties):Similar to
RANK(), it assigns the same rank to tied rows. However,DENSE_RANK()does not skip rank numbers after a tie (e.g., if two rows are rank 1, the next unique rank will be 2).
Table Comparison of Ranking Functions
Let's illustrate ROW_NUMBER(), RANK(), and DENSE_RANK() within a single partition (e.g., for one department, ordered by salary DESC).
| Employee | Salary | ROW_NUMBER() |
RANK() |
DENSE_RANK() |
|---|---|---|---|---|
| Alice | 100,000 | 1 | 1 | 1 |
| Bob | 90,000 | 2 | 2 | 2 |
| Charlie | 80,000 | 3 | 3 | 3 |
| David | 80,000 | 4 | 3 | 3 |
| Eve | 70,000 | 5 | 5 | 4 |
| Frank | 70,000 | 6 | 5 | 4 |
| Grace | 70,000 | 7 | 5 | 4 |
| Hannah | 60,000 | 8 | 8 | 5 |
- `ROW_NUMBER()` is sequential and unique, even for ties.
- `RANK()` skips numbers (e.g., after two 80k salaries get rank 3, Eve gets rank 5, skipping 4).
- `DENSE_RANK()` does not skip numbers (e.g., after two 80k salaries get rank 3, Eve gets rank 4).
Positional Functions
These functions allow you to access data from preceding or succeeding rows within the current row's window.
LAG()(Previous row access):Retrieves a value from a row that comes before the current row within the partition, based on the `ORDER BY` clause. Useful for calculating differences or seeing previous states.
-- Get each employee's salary and the salary of the employee hired immediately before them in their department SELECT employee_id, first_name, hire_date, salary, department_id, LAG(salary, 1) OVER (PARTITION BY department_id ORDER BY hire_date) AS previous_hire_salary FROM Employees;LEAD()(Next row access):Retrieves a value from a row that comes after the current row within the partition, based on the `ORDER BY` clause. Useful for forecasting or comparing with future states.
-- Get each employee's salary and the salary of the employee hired immediately after them in their department SELECT employee_id, first_name, hire_date, salary, department_id, LEAD(salary, 1, 0) OVER (PARTITION BY department_id ORDER BY hire_date) AS next_hire_salary -- 0 is default if no next row FROM Employees;
Partitioning & Ordering
The `OVER()` clause components are critical for defining the scope and sequence of your window function calculations.
PARTITION BY(Resetting the window):Defines the groups of rows over which the window function operates. When the `PARTITION BY` column's value changes, the window (and thus the function's calculation) "resets".
Example:
AVG(salary) OVER (PARTITION BY department_id)calculates a separate average for each uniquedepartment_id.If `PARTITION BY` is omitted, the entire result set is treated as a single partition (one big window).
ORDER BY(Sorting within the window):Determines the logical order of rows within each partition. This is especially important for ranking functions (like
ROW_NUMBER()) and positional functions (likeLAG()/LEAD()) as their results are order-dependent.Example:
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC)ranks employees within each department from highest to lowest salary.
Practice & Application
🎯 Practice: Departmental Salary Analytics
For each employee, retrieve their first_name, last_name, salary, and department_id. In addition, display two new columns: the average salary for their respective department and the highest salary within their department. All details for each individual employee should remain in their own row.
Click for Solution
This problem uses window functions to perform aggregation (AVG() and MAX()) over a partition (department_id) without collapsing the individual employee rows. This means each employee's record is displayed, but enriched with departmental summary statistics.
-- Assume Employees table: (employee_id, first_name, last_name, salary, department_id, hire_date)
SELECT
first_name,
last_name,
salary,
department_id,
AVG(salary) OVER (PARTITION BY department_id) AS avg_dept_salary,
MAX(salary) OVER (PARTITION BY department_id) AS max_dept_salary
FROM
Employees
ORDER BY
department_id, salary DESC;
Explanation:
AVG(salary) OVER (PARTITION BY department_id)calculates the average salary for each distinct `department_id`. This average is then displayed on every row belonging to that department.- Similarly,
MAX(salary) OVER (PARTITION BY department_id)finds the maximum salary within each `department_id` and attaches it to every row within that department. - The crucial aspect is that these calculations happen "over" a window of rows, but the original number of rows (employees) is preserved.
🎯 Practice: Demonstrating Ranking Functions
For employees within a specific department (e.g., department_id = 101), list their first_name, salary, and their rank based on salary in descending order. Show the results from ROW_NUMBER(), RANK(), and DENSE_RANK() side-by-side to observe their differences, especially with tied salaries.
Click for Solution
This problem directly illustrates the behavior of the three main ranking window functions when dealing with ties. We'll filter for a single department to make the output clear, assuming sample data with salary ties.
-- Assume Employees table with sample data for department 101:
-- | employee_id | first_name | salary | department_id |
-- |-------------|------------|--------|---------------|
-- | 1 | Alice | 100000 | 101 |
-- | 2 | Bob | 90000 | 101 |
-- | 3 | Charlie | 80000 | 101 |
-- | 4 | David | 80000 | 101 |
-- | 5 | Eve | 70000 | 101 |
-- | 6 | Frank | 70000 | 101 |
SELECT
first_name,
salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
RANK() OVER (ORDER BY salary DESC) AS rank_val,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank_val
FROM
Employees
WHERE
department_id = 101
ORDER BY
salary DESC, first_name;
Explanation:
- The
OVER (ORDER BY salary DESC)clause defines the window for all three functions. Since there's noPARTITION BY, the window covers all rows for `department_id = 101`. ROW_NUMBER()assigns a unique, sequential number to each row, even when salaries are tied.RANK()assigns the same rank to tied salaries and then skips numbers for the subsequent ranks. For example, if two employees rank 3rd, the next rank will be 5th.DENSE_RANK()assigns the same rank to tied salaries but does *not* skip numbers, leading to a continuous sequence of ranks. If two employees rank 3rd, the next rank will be 4th.
This output precisely matches the behavior depicted in the "Table Comparison of Ranking Functions" visual.
🎯 Practice: Analyzing Salary Evolution Over Time (LAG)
For each employee, retrieve their first_name, last_name, hire_date, and salary. Additionally, calculate the salary of the employee hired immediately before them within the same department. If an employee is the first hire in their department, show NULL for the previous employee's salary.
Click for Solution
This problem utilizes the LAG() window function to access data from a preceding row within a defined partition, enabling "previous value" comparisons.
-- Assume Employees table: (employee_id, first_name, last_name, salary, department_id, hire_date)
SELECT
first_name,
last_name,
hire_date,
salary,
department_id,
LAG(salary, 1) OVER (PARTITION BY department_id ORDER BY hire_date ASC) AS previous_hire_salary
FROM
Employees
ORDER BY
department_id, hire_date ASC;
Explanation:
LAG(salary, 1)retrieves the `salary` value from the row 1 position `LAG` (before) the current row.PARTITION BY department_idensures that the `LAG` function restarts its count and comparison for each new department, preventing an employee from being compared with someone from a different department.ORDER BY hire_date ASCorders the employees within each department by their hire date. This is crucial because `LAG()` (and `LEAD()`) depend on a defined order to determine what "previous" or "next" means.- For the first employee hired in any department, `previous_hire_salary` will be
NULL, as there is no preceding row in that window.
🎯 Practice: Calculating Running Total of Sales Per Customer
You have an Orders table with customer_id, order_id, order_date, and amount. For each order a customer makes, calculate their cumulative spending up to that specific order date, and include this cumulative total on each order row. Order the results by customer, then by order date.
Click for Solution
This problem uses the SUM() aggregate function as a window function with a specific frame clause, allowing for the calculation of running totals.
-- Assume Orders table: (order_id, customer_id, order_date, amount)
SELECT
customer_id,
order_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ASC) AS cumulative_spending
-- The default frame for ORDER BY is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
FROM
Orders
ORDER BY
customer_id, order_date ASC;
Explanation:
SUM(amount)is the aggregate function.PARTITION BY customer_idensures that the cumulative sum is calculated independently for each customer. The sum resets when the customer changes.ORDER BY order_date ASCspecifies the order within each customer's partition. This order is essential for a running total, as it dictates what "up to this point" means.- The default frame clause for aggregate window functions when `ORDER BY` is present (and no `ROWS` or `RANGE` is specified) is implicitly
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This tells the function to sum all `amount` values from the beginning of the partition up to and including the current row.
13. Common SQL Interview Scenarios & Patterns
SQL interviews often revolve around a set of recurring problems and data manipulation challenges. Mastering these patterns will significantly boost your confidence and performance.
Duplicate Handling
Dealing with duplicate records is a very common task in data cleaning and analysis.
- Finding duplicates (`GROUP BY` + `HAVING count > 1`):
To identify rows that have duplicate values for one or more columns, group by those columns and use
HAVINGto filter for groups with a count greater than 1.-- Find duplicate emails in the Employees table SELECT email, COUNT(email) AS num_duplicates FROM Employees GROUP BY email HAVING COUNT(email) > 1;This identifies the *values* that are duplicated. To see the *actual duplicate rows*, you'd typically join this result back to the original table. - Deleting duplicates (Using CTEs or Row_Number):
This is a trickier task as you usually want to keep one instance of the duplicate. Window functions, especially
ROW_NUMBER(), are ideal for this.-- Delete duplicate rows, keeping one instance (e.g., the one with the lowest employee_id) WITH EmployeeDuplicates AS ( SELECT *, -- Select all columns to uniquely identify the row ROW_NUMBER() OVER (PARTITION BY email ORDER BY employee_id ASC) AS rn FROM Employees ) DELETE FROM Employees WHERE employee_id IN ( SELECT employee_id FROM EmployeeDuplicates WHERE rn > 1 ); -- Or if your RDBMS supports DELETE FROM CTE directly (e.g., SQL Server): -- DELETE FROM EmployeeDuplicates WHERE rn > 1;
Top N Analysis
Finding the highest, lowest, or "N-th" value/record is a frequent interview question.
- Finding N-th highest/lowest value (Subquery vs.
DENSE_RANK):Window functions provide the most elegant solution for this.
DENSE_RANK()is often preferred overRANK()because it handles ties without skipping subsequent ranks, which typically aligns with how "N-th" is interpreted.-- Find the 2nd highest salary WITH RankedSalaries AS ( SELECT employee_id, first_name, last_name, salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS salary_rank FROM Employees ) SELECT first_name, last_name, salary FROM RankedSalaries WHERE salary_rank = 2;
Step-by-step logic for N-th highest value
Let's illustrate how DENSE_RANK() helps find the N-th highest value, using the 2nd highest salary as an example.
1. Start with Data
Employees Table:
| ID | Salary |
|---|---|
| 1 | 100k |
| 2 | 90k |
| 3 | 80k |
| 4 | 80k |
| 5 | 70k |
2. Apply DENSE_RANK()
OVER (ORDER BY Salary DESC)
| ID | Salary | Dense_Rank |
|---|---|---|
| 1 | 100k | 1 |
| 2 | 90k | 2 |
| 3 | 80k | 3 |
| 4 | 80k | 3 |
| 5 | 70k | 4 |
3. Filter by Rank
WHERE Dense_Rank = 2
| ID | Salary | Dense_Rank |
|---|---|---|
| 2 | 90k | 2 |
Cumulative Metrics
Calculating running totals or moving averages helps analyze trends over time.
- Running Totals (`SUM() OVER(ORDER BY...)`):
As seen in the Window Functions section, this is a direct application of the
SUM()aggregate as a window function with an implicit (or explicit) frame clause.-- Calculate a running total of order amounts by customer over time SELECT customer_id, order_date, amount, SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total_amount FROM Orders ORDER BY customer_id, order_date; - Moving Averages:
Similar to running totals, but you define a "moving window" (frame) to calculate the average over a specified number of preceding/following rows.
-- Calculate a 3-day moving average of daily sales SELECT sale_date, daily_sales, AVG(daily_sales) OVER (ORDER BY sale_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_day_moving_avg FROM DailySales;
Data Transformation
SQL provides many functions to clean, reformat, and derive new values from existing data.
- String manipulation (`LEFT`, `RIGHT`, `SUBSTRING`, `CONCAT`):
Essential for cleaning, parsing, and formatting text data.
SELECT full_name, LEFT(full_name, 5) AS first_5_chars, RIGHT(full_name, 3) AS last_3_chars, SUBSTRING(full_name, 7, 4) AS middle_4_chars, -- Start at 7th char, take 4 CONCAT(first_name, ' ', last_name) AS full_name_concat FROM Customers; - Date manipulation (`DATEDIFF`, `DATEADD`, Extracting Year/Month):
Crucial for time-series analysis, age calculations, and filtering by date parts.
SELECT order_date, EXTRACT(YEAR FROM order_date) AS order_year, -- PostgreSQL/MySQL EXTRACT(MONTH FROM order_date) AS order_month, DATEDIFF('day', hire_date, CURRENT_DATE) AS days_since_hire, -- PostgreSQL/SQL Server syntax differs DATEADD('month', 3, order_date) AS three_months_later_date -- PostgreSQL/SQL Server syntax differs FROM Orders; - Coalescing (`COALESCE` to handle NULLs):
COALESCE()returns the first non-NULLexpression in a list. It's excellent for providing default values when a column might beNULL.-- Display phone_number, but if NULL, display 'N/A' SELECT employee_id, first_name, COALESCE(phone_number, 'N/A') AS contact_phone FROM Employees;
Practice & Application
🎯 Practice: Identifying and Cleaning Duplicate Customer Emails
You have a Customers table with customer_id, first_name, last_name, and email.
- First, find all
emailaddresses that appear more than once in the table. - Then, write a query to delete all but one instance of each duplicate customer, keeping the row with the lowest
customer_idfor each unique email address.
Click for Solution
This problem demonstrates the crucial steps for duplicate handling: first identifying them, and then selectively deleting them while preserving data integrity.
-- Assume Customers table:
-- (customer_id INT PRIMARY KEY, first_name VARCHAR, last_name VARCHAR, email VARCHAR)
-- Sample data could have:
-- (1, 'Alice', 'Smith', 'alice@example.com')
-- (2, 'Bob', 'Jones', 'bob@example.com')
-- (3, 'Charlie', 'Brown', 'alice@example.com') -- Duplicate email for Alice
-- (4, 'Diana', 'Prince', 'diana@example.com')
-- (5, 'Eve', 'Green', 'bob@example.com') -- Duplicate email for Bob
-- 1. Find all duplicate email addresses
SELECT
email,
COUNT(email) AS duplicate_count
FROM
Customers
GROUP BY
email
HAVING
COUNT(email) > 1;
Explanation for Part 1:
GROUP BY emailgroups all rows that share the same email address.HAVING COUNT(email) > 1filters these groups, showing only those email addresses that appeared more than once.
-- 2. Delete duplicate customers, keeping the one with the lowest customer_id
WITH RankedCustomers AS (
SELECT
customer_id,
email,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY customer_id ASC) AS rn
FROM
Customers
)
DELETE FROM Customers
WHERE customer_id IN (
SELECT customer_id FROM RankedCustomers WHERE rn > 1
);
-- For SQL Server, a more direct DELETE from CTE might be used:
-- DELETE RC FROM RankedCustomers RC WHERE RC.rn > 1;
Explanation for Part 2:
- The
RankedCustomersCTE assigns aROW_NUMBERto each customer, partitioned by `email`. For each unique email, the `ROW_NUMBER` starts at 1. `ORDER BY customer_id ASC` ensures that the lowest `customer_id` for a given email gets `rn=1`. - The outer `DELETE` statement then targets all `customer_id`s from the `RankedCustomers` CTE where `rn > 1`. These are the "extra" duplicate records that we want to remove, leaving only the one with `rn=1`.
🎯 Practice: Finding the 3rd Highest Selling Product per Category
You have Products (product_id, product_name, category_id) and OrderItems (order_item_id, product_id, quantity, price_per_unit).
Find the product_name, category_id, and total sales revenue for the 3rd highest-selling product in each category. If a category has fewer than 3 products, don't return anything for that category or just the available top products if there are any.
Click for Solution
This problem combines aggregation, joins, and the DENSE_RANK() window function within a CTE to perform a "Top N per Group" analysis.
-- Assume Products and OrderItems tables exist.
-- Sample data:
-- Products: (1, 'Laptop', 101), (2, 'Mouse', 101), (3, 'Keyboard', 101), (4, 'Monitor', 101), (5, 'Desk', 102), (6, 'Chair', 102)
-- OrderItems: (OI1, 1, 5, 1000), (OI2, 1, 2, 1000), (OI3, 2, 10, 20), (OI4, 3, 8, 50), (OI5, 4, 3, 200), (OI6, 5, 2, 150), (OI7, 5, 1, 150), (OI8, 6, 4, 80)
WITH ProductSales AS (
SELECT
P.product_id,
P.product_name,
P.category_id,
SUM(OI.quantity * OI.price_per_unit) AS total_revenue
FROM
Products P
JOIN
OrderItems OI ON P.product_id = OI.product_id
GROUP BY
P.product_id, P.product_name, P.category_id
),
RankedProductSales AS (
SELECT
product_id,
product_name,
category_id,
total_revenue,
DENSE_RANK() OVER (PARTITION BY category_id ORDER BY total_revenue DESC) AS rank_in_category
FROM
ProductSales
)
SELECT
product_name,
category_id,
total_revenue
FROM
RankedProductSales
WHERE
rank_in_category = 3
ORDER BY
category_id, total_revenue DESC;
Explanation:
ProductSalesCTE: First, we calculate thetotal_revenuefor each product by joiningProductsandOrderItemsand aggregating.RankedProductSalesCTE: We then applyDENSE_RANK()to this aggregated data.PARTITION BY category_idensures ranking happens independently within each category.ORDER BY total_revenue DESCranks products by their sales.DENSE_RANK()is chosen so that if multiple products tie for, say, 2nd place, the next distinct rank is 3rd (not 4th, like `RANK()`).- Final
SELECT: We select the desired columns fromRankedProductSalesand filterWHERE rank_in_category = 3to get only the 3rd highest-selling products.
🎯 Practice: Calculating Monthly Growth Percentage
You have a DailySales table with sale_date and total_sales_amount. Calculate the total monthly sales for each month and, for each month, determine the percentage growth or decline compared to the previous month. Assume data exists for multiple consecutive months.
Click for Solution
This problem involves multiple steps: aggregating daily sales to monthly, using `LAG()` to get the previous month's sales, and then calculating the percentage change. This is a common time-series analysis pattern.
-- Assume DailySales table: (sale_date DATE, total_sales_amount DECIMAL)
WITH MonthlySales AS (
SELECT
DATE_TRUNC('month', sale_date) AS sales_month, -- PostgreSQL/MySQL; for SQL Server, use DATEFROMPARTS(YEAR(sale_date), MONTH(sale_date), 1)
SUM(total_sales_amount) AS monthly_total_sales
FROM
DailySales
GROUP BY
DATE_TRUNC('month', sale_date)
),
LaggedMonthlySales AS (
SELECT
sales_month,
monthly_total_sales,
LAG(monthly_total_sales, 1) OVER (ORDER BY sales_month ASC) AS previous_month_sales
FROM
MonthlySales
)
SELECT
sales_month,
monthly_total_sales,
previous_month_sales,
CASE
WHEN previous_month_sales IS NULL THEN NULL -- No previous month to compare
WHEN previous_month_sales = 0 THEN NULL -- Avoid division by zero if previous month had 0 sales
ELSE (monthly_total_sales - previous_month_sales) * 100.0 / previous_month_sales
END AS growth_percentage
FROM
LaggedMonthlySales
ORDER BY
sales_month ASC;
Explanation:
MonthlySalesCTE: Aggregates daily sales into monthly totals. `DATE_TRUNC('month', sale_date)` (or equivalent) truncates the date to the first day of the month, allowing for grouping by month.LaggedMonthlySalesCTE: Uses theLAG()window function to fetch the `monthly_total_sales` from the *previous* month. The `ORDER BY sales_month ASC` is crucial for `LAG()` to correctly identify the preceding period.- Final
SELECT: Calculates the `growth_percentage`. A `CASE` statement is used to handle `NULL` for the first month (no previous month to compare) and to prevent division by zero if a previous month had zero sales.
🎯 Practice: Formatting User Names and Contact Info
You have a Users table with user_id, first_name, last_name, email, and an optional phone_number.
- Create a new column called
full_namein "LastName, FirstName" format. - Extract the domain from each
emailaddress (e.g., "example.com" from "user@example.com"). - Display the
phone_number, but if it'sNULL, show "Contact Email" instead.
Click for Solution
This problem focuses on practical data transformation using string and `NULL` handling functions, commonly needed for reporting or display purposes.
-- Assume Users table: (user_id, first_name, last_name, email, phone_number)
SELECT
user_id,
CONCAT(last_name, ', ', first_name) AS full_name, -- 1. Concatenate for full name
SUBSTRING(email FROM POSITION('@' IN email) + 1) AS email_domain, -- 2. Extract domain (PostgreSQL/MySQL)
-- For SQL Server: SUBSTRING(email, CHARINDEX('@', email) + 1, LEN(email)) AS email_domain,
COALESCE(phone_number, 'Contact Email') AS preferred_contact -- 3. Handle NULL phone number
FROM
Users
ORDER BY
full_name;
Explanation:
CONCAT(last_name, ', ', first_name): Joins the last name, a comma and space, and the first name to create the desired `full_name` format.SUBSTRING(email FROM POSITION('@' IN email) + 1):POSITION('@' IN email)finds the starting position of the '@' symbol.- Adding `+ 1` gets the position right after the '@' symbol.
SUBSTRING(email FROM [start_position])extracts the rest of the string from that start position, giving us the domain. (Syntax varies by RDBMS, e.g., SQL Server uses `CHARINDEX` and `LEN`).
COALESCE(phone_number, 'Contact Email'): This function returns the first non-`NULL` expression. If `phone_number` is `NULL`, it will return 'Contact Email'; otherwise, it returns the actual `phone_number`.
14. Advanced Topics & ACID
Moving beyond basic querying and data manipulation, understanding transactions, concurrency, and database design principles (normalization) is crucial for building robust, reliable, and performant database systems. These topics are often explored in more advanced SQL interview questions.
Transactions & Concurrency
A transaction is a single logical unit of work that accesses and possibly modifies the contents of a database. Transactions are critical for maintaining data integrity, especially in multi-user environments.
ACID Properties: The Pillars of Reliability
The reliability of database transactions is described by the ACID properties:
⚛️ Atomicity (All or Nothing)
A transaction is treated as a single, indivisible unit. Either all of its changes are committed (applied to the database), or none of them are. If any part of the transaction fails, the entire transaction is rolled back, and the database returns to its state before the transaction began.
Example: A bank transfer. Either money is successfully debited from Account A AND credited to Account B, or neither happens.
✅ Consistency (Valid State)
A transaction brings the database from one valid state to another. It ensures that any data written to the database must be valid according to all defined rules (constraints, triggers, cascades). If a transaction attempts to violate these rules, it is rolled back.
Example: An `INSERT` operation must adhere to `NOT NULL`, `UNIQUE`, `FOREIGN KEY` constraints. If not, the transaction fails, preserving consistency.
🔒 Isolation (Transaction Visibility)
Concurrent transactions execute in such a way that the intermediate states of one transaction are not visible to other concurrent transactions. This gives the illusion that transactions are running serially, even if they are executing in parallel.
Example: If User A is updating a record, User B should not see the half-updated (inconsistent) state of that record until User A's transaction is committed.
💾 Durability (Saved Permanently)
Once a transaction has been committed, its changes are permanent and will persist even in the event of system failures (power loss, crashes). The database management system ensures that committed data is written to non-volatile storage.
Example: After a successful credit card payment is committed, even if the server immediately crashes, the payment record will still be there when the system recovers.
Diagram of a Transaction Lifecycle (Begin → Commit/Rollback)
BEGIN TRANSACTION
(Start a logical unit of work)
Execute SQL Statements
(DML: INSERT, UPDATE, DELETE)
COMMIT
(Save changes permanently)
ROLLBACK
(Undo all changes)
-- Example of a transaction
BEGIN TRANSACTION; -- Start the transaction
UPDATE Accounts SET balance = balance - 100 WHERE account_id = 'A123';
INSERT INTO Transactions (from_account, to_account, amount) VALUES ('A123', 'B456', 100);
UPDATE Accounts SET balance = balance + 100 WHERE account_id = 'B456';
-- Check if everything went well (e.g., no negative balances, etc.)
-- IF conditions_met THEN
COMMIT; -- Save changes to the database
-- ELSE
ROLLBACK; -- Undo all changes made since BEGIN TRANSACTION
-- END IF;
Normalization (Database Design)
Normalization is a systematic approach to designing relational database schemas to minimize data redundancy and improve data integrity. It involves breaking down a large table into smaller, related tables and defining relationships between them.
- 1NF (First Normal Form): Atomic values
- Each column must contain atomic (indivisible) values. No multi-valued attributes or repeating groups.
- Each row must be unique (have a primary key).
Violation Example: A 'phone_numbers' column containing "555-1234, 555-5678" for a single person. - 2NF (Second Normal Form): No partial dependencies
- Must be in 1NF.
- All non-key attributes must be fully functionally dependent on the entire primary key. (This applies only to tables with composite primary keys).
Violation Example: If `(order_id, product_id)` is a composite PK, and `product_name` depends only on `product_id` (a *part* of the PK), not on `order_id`. This should be in a separate `Products` table. - 3NF (Third Normal Form): No transitive dependencies
- Must be in 2NF.
- All non-key attributes must be non-transitively dependent on the primary key. This means no non-key attribute should depend on another non-key attribute.
Violation Example: In an `Employees` table, if `department_name` depends on `department_id`, and `department_id` depends on `employee_id`. Here, `department_name` transitively depends on `employee_id` via `department_id`. `department_name` should be in a separate `Departments` table. - Denormalization (When to break rules for performance):
Sometimes, for specific performance reasons (e.g., complex reporting, reducing joins on read-heavy systems), you might intentionally introduce redundancy into your database design. This is called denormalization.
- ✅ Pros: Faster read queries (fewer joins), simplified queries.
- ❌ Cons: Increased data redundancy, higher risk of data inconsistency, slower write operations (due to updating redundant data).
Denormalization is a conscious trade-off made after careful performance analysis, not a default design choice.
Practice & Application
🎯 Practice: Simulating an Atomic Bank Transfer (ACID)
You need to transfer $100 from Account A (ID 1) to Account B (ID 2). Write a SQL transaction that ensures either both the debit from A and the credit to B complete successfully, or neither does. If Account A has insufficient funds (less than $100), the entire transfer should be aborted.
Click for Solution
This problem demonstrates the Atomicity and Consistency properties of ACID. The `BEGIN TRANSACTION`, `COMMIT`, and `ROLLBACK` commands are used to ensure the transfer is a single, all-or-nothing operation, and that account balances remain consistent.
-- Assume Accounts table: (account_id INT PRIMARY KEY, balance DECIMAL(10, 2))
-- Initial state: Account 1 has $500, Account 2 has $100.
-- Let's test with Account 1 having $50 for the ROLLBACK case after running the success case.
BEGIN TRANSACTION;
-- Check balance of Account 1
DECLARE @account_A_balance DECIMAL(10, 2);
SELECT @account_A_balance = balance FROM Accounts WHERE account_id = 1;
IF @account_A_balance >= 100 THEN
-- Debit Account 1
UPDATE Accounts
SET balance = balance - 100
WHERE account_id = 1;
-- Credit Account 2
UPDATE Accounts
SET balance = balance + 100
WHERE account_id = 2;
-- All operations successful, commit the transaction
COMMIT;
RAISE NOTICE 'Transfer successful!';
ELSE
-- Insufficient funds, roll back the transaction
ROLLBACK;
RAISE NOTICE 'Transfer failed: Insufficient funds in Account 1.';
END IF;
Explanation:
BEGIN TRANSACTIONmarks the start of the atomic unit of work.- The `IF` condition checks for sufficient funds.
- If funds are sufficient, both `UPDATE` statements are executed. If either fails for any reason (e.g., constraint violation, system error), the subsequent `ROLLBACK` (or an implicit rollback due to error) would undo all changes.
COMMITmakes all changes permanent, satisfying Durability.- If funds are insufficient, `ROLLBACK` is explicitly called, ensuring that no changes are applied to the database, maintaining Atomicity (all or nothing) and Consistency (no account is partially debited/credited).
🎯 Practice: Preventing a Dirty Read (ACID - Isolation)
Describe a scenario where two concurrent transactions could lead to a "Dirty Read" problem. Explain how the default isolation level in many databases prevents this, upholding the Isolation property.
Click for Solution
This example focuses on the conceptual understanding of Isolation and common concurrency issues.
Scenario: Dirty Read
- Transaction A starts: A user (e.g., HR) starts updating an employee's salary from $50,000 to $60,000 but has not yet committed the change.
- Transaction B starts concurrently: Another user (e.g., Payroll) queries the same employee's salary to calculate their next paycheck.
- The "Dirty Read": If Transaction B reads the salary before Transaction A commits, it might see the $60,000 value.
- Problem: If Transaction A then encounters an error and rolls back its change (salary reverts to $50,000), Transaction B has made a calculation based on data that never actually existed in the database. This is a "Dirty Read" because Transaction B read uncommitted data.
How Isolation Prevents It (Read Committed)
Most modern relational databases default to an isolation level of Read Committed (or stricter). This level ensures that:
- A transaction can only read data that has been committed by other transactions.
- When Transaction A updates the employee's salary to $60,000 but hasn't committed, a lock is typically held on that row (or versioning is used).
- If Transaction B attempts to read that employee's salary, the database will either wait for Transaction A to commit (or rollback) or will return the *last committed version* of the data (the $50,000 salary), depending on the specific implementation (locking vs. MVCC - Multi-Version Concurrency Control).
- This ensures that Transaction B never sees the intermediate, uncommitted state of Transaction A, thus preventing the Dirty Read and upholding the Isolation property.
🎯 Practice: Normalizing a Table to 1NF
You have a table named Customers with the following structure and sample data:
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
customer_address VARCHAR(255),
phone_numbers VARCHAR(255) -- Stores multiple phone numbers like "555-1234, 555-5678"
);
INSERT INTO Customers VALUES (1, 'Alice', '123 Main St', '555-1234, 555-5678');
INSERT INTO Customers VALUES (2, 'Bob', '456 Oak Ave', '555-9999');
Identify why this table violates First Normal Form (1NF) and propose a normalized 1NF schema with SQL `CREATE TABLE` statements for the corrected structure.
Click for Solution
This problem highlights the concept of atomic values in 1NF, which prohibits multi-valued attributes within a single column.
1NF Violation:
The phone_numbers column violates 1NF because it stores multiple values (a comma-separated list of phone numbers) within a single cell. Each column should contain only atomic (indivisible) values.
Normalized 1NF Schema:
To achieve 1NF, we split the multi-valued `phone_numbers` into a separate, related table, giving each phone number its own row.
-- Original Customers table remains (without phone_numbers column or just with a single main_phone column)
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
customer_address VARCHAR(255)
);
-- New table for phone numbers
CREATE TABLE CustomerPhones (
phone_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- Surrogate PK for phone number entry
customer_id INT NOT NULL,
phone_number VARCHAR(20) NOT NULL,
CONSTRAINT FK_CustomerPhone FOREIGN KEY (customer_id) REFERENCES Customers(customer_id),
CONSTRAINT UQ_CustomerPhoneNumber UNIQUE (customer_id, phone_number) -- A customer can't have the same phone number twice
);
-- Sample Data after normalization:
-- Customers:
-- (1, 'Alice', '123 Main St')
-- (2, 'Bob', '456 Oak Ave')
-- CustomerPhones:
-- (101, 1, '555-1234')
-- (102, 1, '555-5678')
-- (103, 2, '555-9999')
Explanation:
- The original `Customers` table no longer stores multiple phone numbers in one field.
- A new table, `CustomerPhones`, is created. Each row in `CustomerPhones` now represents a single, atomic phone number.
- `customer_id` acts as a foreign key, linking each phone number back to its respective customer.
- A composite `UNIQUE` constraint `(customer_id, phone_number)` ensures a customer doesn't have the exact same phone number listed multiple times.
🎯 Practice: Normalizing a Table to 2NF
You have an OrderDetails table with a composite primary key (order_id, product_id):
CREATE TABLE OrderDetails (
order_id INT,
product_id INT,
product_name VARCHAR(100), -- Product name depends only on product_id
quantity INT,
price_per_unit DECIMAL(10, 2),
PRIMARY KEY (order_id, product_id)
);
INSERT INTO OrderDetails VALUES (101, 1, 'Laptop', 1, 1200.00);
INSERT INTO OrderDetails VALUES (101, 2, 'Mouse', 2, 25.00);
INSERT INTO OrderDetails VALUES (102, 1, 'Laptop', 1, 1200.00);
Identify why this table violates Second Normal Form (2NF) and propose a normalized 2NF schema with SQL `CREATE TABLE` statements.
Click for Solution
This problem addresses 2NF, which deals with partial dependencies when a table has a composite primary key.
2NF Violation:
The table is in 1NF (all values are atomic). However, it violates 2NF because the non-key attribute `product_name` is partially dependent on the primary key. `product_name` depends only on `product_id` (a part of the composite primary key), not on the full `(order_id, product_id)` combination.
This leads to redundancy: `Laptop` and its price are repeated for every order that includes `product_id = 1`.
Normalized 2NF Schema:
To achieve 2NF, we decompose the table into two tables: one for order item details and one for product details.
-- Original OrderDetails table modified to store only order-specific item details
CREATE TABLE OrderItems (
order_id INT,
product_id INT,
quantity INT,
price_per_unit DECIMAL(10, 2), -- This price might vary per order, so keep it here
PRIMARY KEY (order_id, product_id),
CONSTRAINT FK_OrderItemOrder FOREIGN KEY (order_id) REFERENCES Orders(order_id), -- Assuming an Orders table exists
CONSTRAINT FK_OrderItemProduct FOREIGN KEY (product_id) REFERENCES Products(product_id) -- References the new Products table
);
-- New Products table for product-specific information
CREATE TABLE Products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100) NOT NULL UNIQUE
);
-- Sample Data after normalization:
-- OrderItems:
-- (101, 1, 1, 1200.00)
-- (101, 2, 2, 25.00)
-- (102, 1, 1, 1200.00)
-- Products:
-- (1, 'Laptop')
-- (2, 'Mouse')
Explanation:
- The `Products` table now stores product-specific attributes (`product_name`) that depend solely on `product_id`.
- The `OrderItems` table maintains the composite primary key `(order_id, product_id)` and stores attributes that fully depend on this key (`quantity`, `price_per_unit`).
- `product_id` in `OrderItems` acts as a foreign key to `Products`, linking the two tables. This eliminates the redundancy of `product_name` and `product_price` in the `OrderItems` table.
🎯 Practice: Normalizing a Table to 3NF
You have an Employees table (in 2NF) with the following structure and sample data:
CREATE TABLE Employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department_id INT,
department_name VARCHAR(100) -- Department name depends on department_id
);
INSERT INTO Employees VALUES (1, 'Alice', 10, 'Sales');
INSERT INTO Employees VALUES (2, 'Bob', 20, 'Marketing');
INSERT INTO Employees VALUES (3, 'Charlie', 10, 'Sales');
Identify why this table violates Third Normal Form (3NF) and propose a normalized 3NF schema with SQL `CREATE TABLE` statements.
Click for Solution
This problem addresses 3NF, which deals with transitive dependencies, where a non-key attribute depends on another non-key attribute.
3NF Violation:
The table is in 2NF. However, it violates 3NF because `department_name` (a non-key attribute) is transitively dependent on `employee_id` (the primary key) through `department_id` (another non-key attribute). In other words, `department_name` depends on `department_id`, which in turn depends on `employee_id`.
This leads to redundancy: `Sales` and `Marketing` department names are repeated for every employee in those departments.
Normalized 3NF Schema:
To achieve 3NF, we extract the `department_id` and `department_name` into a separate `Departments` table.
-- Employees table
CREATE TABLE Employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department_id INT,
CONSTRAINT FK_EmployeeDepartment FOREIGN KEY (department_id) REFERENCES Departments(department_id)
);
-- New Departments table
CREATE TABLE Departments (
department_id INT PRIMARY KEY,
department_name VARCHAR(100) NOT NULL UNIQUE
);
-- Sample Data after normalization:
-- Employees:
-- (1, 'Alice', 10)
-- (2, 'Bob', 20)
-- (3, 'Charlie', 10)
-- Departments:
-- (10, 'Sales')
-- (20, 'Marketing')
Explanation:
- The `Departments` table now stores department-specific attributes (`department_name`) that depend solely on `department_id`.
- The `Employees` table retains `employee_id` (PK), `employee_name`, and `department_id`.
- `department_id` in `Employees` acts as a foreign key to `Departments`, linking employees to their departments.
- This eliminates the redundancy of department names and ensures that if a department name changes, it only needs to be updated in one place.
🎯 Practice: Denormalization for Reporting Performance
You have a fully normalized database with Customers (customer_id, customer_name), Orders (order_id, customer_id, order_date), and Products (product_id, product_name). For a daily sales report, the business frequently needs to see customer_name, order_date, and product_name for every item sold.
While this can be done with joins, explain a scenario where you might consider denormalizing this data into a dedicated reporting table, and what the trade-offs would be.
Click for Solution
This problem explores the practical reasons for denormalization, a common strategy in data warehousing and OLAP systems where read performance outweighs the strict adherence to normalization rules.
Scenario for Denormalization:
For a highly frequented daily sales report, users constantly query customer names, order dates, and product names. In a fully normalized schema, this would require joining at least three tables (`Customers`, `Orders`, `OrderItems` (which links to `Products`)). For millions of orders, these joins can become computationally expensive and slow for repeated queries.
A common denormalization strategy here would be to create a "fact table" or a "summary table" in a data warehouse context. For instance, a `SalesFact` table.
Proposed Denormalized Structure (Conceptual `SalesFact` Table):
Instead of linking via foreign keys and performing joins on the fly, the `SalesFact` table could directly include redundant `customer_name` and `product_name` columns:
-- Denormalized table for reporting (created/refreshed daily)
CREATE TABLE SalesFact_Denormalized (
sales_fact_id INT PRIMARY KEY,
order_id INT,
order_date DATE,
customer_id INT,
customer_name VARCHAR(100), -- Redundant: Copied from Customers
product_id INT,
product_name VARCHAR(100), -- Redundant: Copied from Products
quantity INT,
total_item_price DECIMAL(10, 2)
);
-- Data for this table would typically be populated via ETL processes
-- (Extract, Transform, Load) from the normalized OLTP system.
Trade-offs:
- ✅ Pros of Denormalization:
- Faster Read Performance: Queries for the daily sales report would involve reading from a single, wider table, eliminating complex joins and drastically improving query speed.
- Simpler Queries: Reporting queries become much simpler, requiring fewer joins and less complex logic.
- Pre-aggregated Data: Can also pre-aggregate common metrics (e.g., daily sales by product, by customer) to serve frequently requested reports even faster.
- ❌ Cons of Denormalization:
- Increased Data Redundancy: `customer_name` and `product_name` are repeated for every order item, consuming more storage.
- Higher Risk of Data Inconsistency: If a customer's name changes in the `Customers` table, the `SalesFact_Denormalized` table might become outdated until it's refreshed. More effort is needed to keep redundant data synchronized.
- Slower Write/Update Performance: If `SalesFact_Denormalized` were directly updated in real-time, any changes to a customer's or product's core data would require updates in multiple places. (Typically, this table is rebuilt/refreshed, not updated row-by-row).
- Increased ETL Complexity: A separate ETL (Extract, Transform, Load) process is needed to populate and maintain the denormalized table from the normalized operational database.
Conclusion: Denormalization is a strategic decision, often made in analytical or reporting databases (OLAP systems) where read-heavy workloads demand speed, and the operational database (OLTP system) maintains the normalized, consistent source of truth.
15. Exam & Interview Strategy Guide
You've covered a comprehensive range of SQL topics, from fundamentals to advanced concepts. Now, let's equip you with a robust strategy for approaching SQL interview questions and technical exams confidently.
Problem Solving Framework
Adopt a structured approach to tackle any SQL problem. This framework helps break down complex tasks into manageable steps.
1. Understand Inputs/Outputs
What tables, columns, and sample data are you given? What is the exact expected output (columns, aggregation level, filters)? Clarify any ambiguities!
2. Identify the Grain of the Data
What does each row in your *final* output represent? Are you returning individual records, or aggregated summaries (e.g., one row per customer, one row per day)? This dictates `GROUP BY` strategy.
3. Write the Skeleton Query
Start with the fundamental structure. Which tables do you need? What joins? What's the initial `SELECT` list? Don't worry about complexity yet.
SELECT
-- ... columns ...
FROM
Table1
JOIN
Table2 ON ...
WHERE
-- ... basic filters ...
GROUP BY
-- ... grouping columns ...
HAVING
-- ... group filters ...
ORDER BY
-- ... sorting ...
4. Fill in the Logic
Incrementally add `WHERE` conditions, `CASE` statements, window functions, and more complex logic. Test each step if possible. Use CTEs to break down intricate logic.
Debugging Your Own Code
Even the best developers write bugs. Knowing how to find and fix them efficiently is a valuable skill.
- 🛠️ Check Intermediate Results with CTEs: If your query is long or complex, use CTEs to isolate different parts of the logic. Select from each CTE individually to verify that it's producing the expected intermediate data. This is far easier than trying to debug a single, monolithic query.
- 🛠️ Verify Row Counts Before/After Joins: The "fan-out" problem (Section 7) is a common source of errors. Before and after each `JOIN` or `UNION`, check the `COUNT(*)` to ensure your row count changes as expected. Unexpected increases often point to incorrect join conditions or implicit Cartesian products.
- 🛠️ Simplify, then Rebuild: If a query is not working, remove all non-essential clauses (
ORDER BY, complex `WHERE` conditions, extra columns). Get the simplest core query working, then gradually add back complexity, testing at each step.
Review Checklist
Before submitting your solution or declaring it "done," run through these critical checks:
- ✅ Did I handle NULLs? This is a common pitfall. Are `NULL`s correctly filtered (
IS NULL/IS NOT NULL), sorted (NULLS FIRST/LAST), or replaced (COALESCE)? - ✅ Did I handle duplicates? Does the problem require unique records (`DISTINCT`, `GROUP BY`, `UNION`) or are duplicates expected (`UNION ALL`)? Are aggregations (like `COUNT`) impacted by unintended duplicates?
- ✅ Is the syntax dialect-specific? Be aware of differences between T-SQL (SQL Server), PostgreSQL, MySQL, Oracle SQL. Functions like `DATE_TRUNC`, `DATEDIFF`, `CONCAT`, and auto-incrementing primary keys vary. State your assumed dialect if not specified.
- ✅ Is the query efficient? For interviews, mentioning potential optimizations (indexes, avoiding correlated subqueries, using `EXISTS` instead of `IN` for subqueries if `NULL`s are an issue) can impress.
- ✅ Is the code readable? Use aliases, consistent capitalization, and proper indentation. Comments can explain complex logic.
Resources
Practice is paramount! Here are some recommended platforms for honing your SQL skills:
- 📚 LeetCode: Excellent for algorithmic SQL problems, covers a wide range of difficulty.
- 📚 HackerRank: Good for beginner to intermediate SQL challenges.
- 📚 StrataScratch: Focuses on real-world data science interview questions, often from top tech companies.
- 📚 SQLZoo: Interactive tutorials and exercises, great for learning basics.
- 📚 Kaggle Notebooks: Explore public datasets and see how others solve problems with SQL.
Homework / Challenges
🎯 Challenge: High-Value Customer Segmentation
You have access to three tables:
Customers: (customer_id(PK),customer_name,registration_date)Orders: (order_id(PK),customer_id(FK),order_date,total_amount)OrderItems: (order_item_id(PK),order_id(FK),product_id(FK),quantity)
Your task is to identify "High-Value Customers". A customer is considered "High-Value" if they meet both of these criteria:
- Their total lifetime spending is greater than the overall average total spending per customer.
- They have placed a total number of orders greater than the overall average number of orders per customer.
For each High-Value customer, return their customer_name, their total lifetime spending, total number of orders, and the count of distinct products they've purchased.
Click to see Solution
Step 1: Calculate Customer Lifetime Value (Total Spending, Total Orders, Distinct Products)
First, we need to get the lifetime metrics for each customer. This involves joining Customers with Orders and OrderItems, then grouping by customer to aggregate their spending, order count, and unique product purchases.
WITH CustomerLifetimeMetrics AS (
SELECT
C.customer_id,
C.customer_name,
SUM(O.total_amount) AS total_spending,
COUNT(DISTINCT O.order_id) AS total_orders,
COUNT(DISTINCT OI.product_id) AS distinct_products_purchased
FROM
Customers C
JOIN
Orders O ON C.customer_id = O.customer_id
JOIN
OrderItems OI ON O.order_id = OI.order_id
GROUP BY
C.customer_id, C.customer_name
)
SELECT * FROM CustomerLifetimeMetrics; -- Intermediate result check
Step 2: Calculate Overall Average Spending and Order Count per Customer
Next, we need the company-wide averages for total spending and total orders per customer. These will be scalar values that we can use for comparison.
-- This could be another CTE or derived table
WITH CustomerLifetimeMetrics AS (
-- ... (Same as Step 1)
),
OverallAverages AS (
SELECT
AVG(total_spending) AS avg_customer_spending,
AVG(total_orders) AS avg_customer_orders
FROM
CustomerLifetimeMetrics
)
SELECT * FROM OverallAverages; -- Intermediate result check
Step 3: Combine and Filter for High-Value Customers
Finally, we combine these CTEs. We'll select from CustomerLifetimeMetrics and join it (conceptually, or cross join if overall averages are pre-calculated) or simply use the overall averages in the WHERE clause as scalar subqueries. We apply the "High-Value" criteria in the WHERE clause.
WITH CustomerLifetimeMetrics AS (
SELECT
C.customer_id,
C.customer_name,
SUM(O.total_amount) AS total_spending,
COUNT(DISTINCT O.order_id) AS total_orders,
COUNT(DISTINCT OI.product_id) AS distinct_products_purchased
FROM
Customers C
JOIN
Orders O ON C.customer_id = O.customer_id
JOIN
OrderItems OI ON O.order_id = OI.order_id
GROUP BY
C.customer_id, C.customer_name
),
OverallAverages AS (
SELECT
AVG(total_spending) AS avg_customer_spending,
AVG(total_orders) AS avg_customer_orders
FROM
CustomerLifetimeMetrics
)
SELECT
CLM.customer_name,
CLM.total_spending,
CLM.total_orders,
CLM.distinct_products_purchased
FROM
CustomerLifetimeMetrics CLM, OverallAverages OA -- Using implicit CROSS JOIN or just reference scalar values
WHERE
CLM.total_spending > OA.avg_customer_spending
AND CLM.total_orders > OA.avg_customer_orders
ORDER BY
CLM.total_spending DESC;
Conclusion: This solution effectively segments customers by first computing individual customer metrics, then calculating company-wide averages, and finally filtering based on those averages. The use of CTEs makes the multi-step logic clear and readable.
🎯 Challenge: Longest Consecutive Daily Login Streak
You have a table named UserLogins with the following columns: (login_id (PK), user_id (FK), login_timestamp (TIMESTAMP)).
Your goal is to determine the longest consecutive daily login streak for each user. A "daily login" means a login occurred on a particular calendar day. A streak is considered consecutive if each day's login is within 24 hours of the previous day's earliest login for that user. If a user logs in multiple times on the same day, it counts as one day for the streak.
Return user_id and their longest_streak_days.
Click to see Solution
This is a classic and challenging problem that requires a combination of date functions, window functions (especially LAG() and `ROW_NUMBER()`), and multiple CTEs to isolate distinct login days and then group consecutive days.
Step 1: Identify Unique Daily Logins for Each User
First, we need to ensure each day is counted only once, even if a user logs in multiple times on that day. We'll extract the date part of the `login_timestamp` and get distinct entries.
WITH DistinctDailyLogins AS (
SELECT
user_id,
CAST(login_timestamp AS DATE) AS login_day -- Extract just the date part (syntax varies by RDBMS)
-- For MySQL: DATE(login_timestamp)
-- For SQL Server: CAST(login_timestamp AS DATE)
FROM
UserLogins
GROUP BY
user_id, CAST(login_timestamp AS DATE) -- Group to get distinct login days
),
SELECT * FROM DistinctDailyLogins; -- Intermediate result check
Step 2: Calculate the Difference in Days from a "Base" Date (The Gaps-and-Islands Method)
This is the core trick: we assign a `ROW_NUMBER()` to each login day for a user. Then, we subtract this `ROW_NUMBER()` from the actual `login_day`. If the result (the "streak_group") is the same for consecutive days, it means those days are part of the same streak, even if there are gaps in the actual `login_day` values themselves. This creates a "group identifier" for each streak.
WITH DistinctDailyLogins AS (
-- ... (Same as Step 1)
),
StreakGroups AS (
SELECT
user_id,
login_day,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) AS rn,
CAST(login_day AS DATE) - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) * INTERVAL '1 DAY') AS streak_group
-- The above line calculates the difference from a 'virtual' continuous sequence.
-- If login_day is 2023-01-05 and rn is 5, streak_group is 2023-01-01.
-- If next login_day is 2023-01-06 and rn is 6, streak_group is 2023-01-01. (Same group)
-- If next login_day is 2023-01-08 and rn is 7, streak_group is 2023-01-01. (Still same group due to direct date arithmetic)
-- This isn't quite right for 'more than 24 hours'. Let's adjust for DATEDIFF for cleaner streak grouping.
-- Corrected for DATEDIFF logic:
-- DATEDIFF (in days) from an arbitrary fixed date.
-- Then subtract the ROW_NUMBER to get the streak_group.
-- Example (PostgreSQL DATEDIFF-equivalent):
-- CAST(login_day AS DATE) - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) * INTERVAL '1 DAY') AS streak_group_identifier
-- Or simply, as an integer representing the group:
-- (EXTRACT(EPOCH FROM login_day) / 86400) - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) AS streak_group_id
-- Let's use `login_day - INTERVAL '1 day' * (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day))` which works in PostgreSQL.
-- For SQL Server/MySQL, convert date to integer (e.g., YYYYMMDD) or use DATEDIFF from a base date.
-- Simpler `login_day - (row_number() over (partition by user_id order by login_day)) * INTERVAL '1 day'` should work on most `DATE` types
FROM
DistinctDailyLogins
)
SELECT * FROM StreakGroups; -- Intermediate result check
Step 3: Calculate the Length of Each Streak
Now that we have grouped consecutive login days, we can simply count the number of days within each `streak_group` for each user.
WITH DistinctDailyLogins AS (
-- ... (Same as Step 1)
),
StreakGroups AS (
SELECT
user_id,
login_day,
-- Using an expression that results in the same value for consecutive days, but changes when a gap occurs
-- Example (PostgreSQL-like syntax for date arithmetic):
login_day - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) * INTERVAL '1 DAY') AS streak_identifier
-- For SQL Server: DATEADD(day, -1 * (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day)), login_day) AS streak_identifier
-- For MySQL: DATE_SUB(login_day, INTERVAL (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day)) DAY) AS streak_identifier
FROM
DistinctDailyLogins
),
StreakLengths AS (
SELECT
user_id,
streak_identifier,
COUNT(*) AS current_streak_length
FROM
StreakGroups
GROUP BY
user_id, streak_identifier
)
SELECT * FROM StreakLengths; -- Intermediate result check
Step 4: Find the Longest Streak for Each User
Finally, for each user, we find the maximum `current_streak_length` from all their identified streaks.
WITH DistinctDailyLogins AS (
SELECT
user_id,
CAST(login_timestamp AS DATE) AS login_day -- Using CAST(date AS DATE) for cross-RDBMS clarity
FROM
UserLogins
GROUP BY
user_id, CAST(login_timestamp AS DATE)
),
StreakGroups AS (
SELECT
user_id,
login_day,
-- This calculation creates a common "group" for consecutive days.
-- If a day is consecutive, the difference between its `login_day` and its `row_number` will be constant.
CAST(login_day AS DATE) - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) * INTERVAL '1 DAY') AS streak_identifier
-- Adjust INTERVAL '1 DAY' type based on RDBMS. Some might need `NUMERIC_DATE - ROW_NUMBER()`
FROM
DistinctDailyLogins
),
StreakLengths AS (
SELECT
user_id,
streak_identifier,
COUNT(*) AS current_streak_length
FROM
StreakGroups
GROUP BY
user_id, streak_identifier
)
SELECT
user_id,
MAX(current_streak_length) AS longest_streak_days
FROM
StreakLengths
GROUP BY
user_id
ORDER BY
user_id;
Conclusion: This complex problem showcases the power of chained CTEs and window functions, particularly the "Gaps and Islands" method, to identify and measure consecutive sequences in time-series data. By creating a `streak_identifier` that remains constant for consecutive days and changes for non-consecutive days, we effectively group the streaks and then calculate their lengths.
