Grouping Data with GROUP BY
Remember aggregate functions from the last lesson? They gave us one summary for the entire dataset: "The average GPA is 3.6." But what if you want to break that down by category? For example, "What's the average GPA per major?" or "What are total sales per region?" That's where GROUP BY comes inβit's the bridge that connects aggregate functions to categorical analysis. This is one of the most powerful features in SQL!
π What You'll Learn
- What GROUP BY does and why it's essential for analysis
- How GROUP BY creates one row per group
- The partnership between GROUP BY and aggregate functions
- Grouping by a single column (e.g., by major, by region)
- Grouping by multiple columns (e.g., by major AND year)
- How GROUP BY changes the meaning of SELECT
- The golden rule: every non-aggregated column MUST appear in GROUP BY
- Using WHERE to filter before grouping
- Using HAVING to filter after grouping
- The complete execution order: WHERE β GROUP BY β HAVING
- Common beginner mistakes and how to avoid them
- Real-world business analytics scenarios
- Hands-on practice exercises
1. What Does GROUP BY Do?
GROUP BY divides your data into groups based on column values, then calculates aggregates for each group separately.
π‘ Real-World Example: Imagine you're a teacher analyzing test scores. Without GROUP BY, you'd say: "The class average is 85." With GROUP BY, you can break it down: "Period 1 averaged 88, Period 2 averaged 82, Period 3 averaged 87." Same data, but now organized by category (period). That's exactly what GROUP BY does!
Visual Transformation: From Individual Rows to Groups
WITHOUT GROUP BY (Aggregate for All)
Input: All Students
| name | major | gpa |
|---|---|---|
| Alice | CS | 3.8 |
| Bob | CS | 3.6 |
| Carol | Bio | 3.9 |
| Diana | Bio | 3.7 |
Output: One Row
SELECT AVG(gpa)
FROM students;
| AVG(gpa) |
|---|
| 3.75 |
WITH GROUP BY (Aggregate per Group)
Input: Same Students
| name | major | gpa |
|---|---|---|
| Alice | CS | 3.8 |
| Bob | CS | 3.6 |
| Carol | Bio | 3.9 |
| Diana | Bio | 3.7 |
Output: One Row per Group
SELECT major, AVG(gpa)
FROM students
GROUP BY major;
| major | AVG(gpa) |
|---|---|
| CS | 3.7 |
| Bio | 3.8 |
See the magic? GROUP BY split students into CS and Biology groups, then calculated the average for each group separately!
2. Basic GROUP BY Syntax
SELECT column_to_group_by, AGGREGATE_FUNCTION(column)
FROM table_name
GROUP BY column_to_group_by;
Example: Average GPA per Major
SELECT
major,
AVG(gpa) AS average_gpa
FROM students
GROUP BY major;
How It Works (Step by Step):
- Step 1: SQL looks at the
majorcolumn - Step 2: Creates separate groups for each unique major (CS, Biology, Math, etc.)
- Step 3: Calculates AVG(gpa) within each group
- Step 4: Returns one row per group (one row for CS, one for Biology, etc.)
3. Critical Concept: One Row Per Group
This is the most important thing to understand about GROUP BY:
β‘ The Golden Rule: GROUP BY produces exactly one row for each unique group. If you have 5 majors, you get 5 rows. If you have 50 regions, you get 50 rows.
Example: Counting Students per Major
SELECT
major,
COUNT(*) AS student_count
FROM students
GROUP BY major;
Sample Result:
| major | student_count |
|---|---|
| Computer Science | 45 |
| Biology | 32 |
| Mathematics | 28 |
| Engineering | 51 |
Four unique majors = four rows. Each row shows the major and how many students are in it.
4. The Most Important Rule: SELECT and GROUP BY Must Match
Here's the rule that trips up everyone at first:
π¨ Critical Rule: Every column in SELECT that is NOT inside an aggregate function MUST appear in GROUP BY.
Valid Queries (Follow the Rule):
-- β
CORRECT: major is in SELECT and GROUP BY
SELECT major, AVG(gpa)
FROM students
GROUP BY major;
-- β
CORRECT: Only aggregate in SELECT (no GROUP BY needed)
SELECT AVG(gpa)
FROM students;
-- β
CORRECT: Both columns in SELECT appear in GROUP BY
SELECT major, year, AVG(gpa)
FROM students
GROUP BY major, year;
Invalid Query (Breaks the Rule):
-- β WRONG: name is in SELECT but NOT in GROUP BY!
SELECT name, major, AVG(gpa)
FROM students
GROUP BY major;
Why This Fails (Logical Explanation):
The Problem: We're grouping by major, so we get one row per major (one for CS, one for Biology). But which student's name should appear in that row? CS has 45 studentsβshould it show Alice? Bob? All 45 names?
SQL's Response: "I don't know which name to pick!" β ERROR
The Solution: Either GROUP BY name (one row per student), or don't include name in SELECT.
5. Grouping by Multiple Columns
You can group by multiple columns to create more specific groups:
-- Average GPA per major AND year
SELECT
major,
year,
AVG(gpa) AS avg_gpa
FROM students
GROUP BY major, year;
How Multi-Column Grouping Works:
SQL creates groups based on unique combinations of the columns:
| major | year | avg_gpa |
|---|---|---|
| CS | 2023 | 3.65 |
| CS | 2024 | 3.72 |
| Biology | 2023 | 3.81 |
| Biology | 2024 | 3.76 |
Each combination gets its own row: CS 2023, CS 2024, Biology 2023, Biology 2024.
π‘ Pro Tip: The order of columns in GROUP BY doesn't change the resultβGROUP BY major, year is the same as GROUP BY year, major. But column order in SELECT does matter for display!
6. Filtering with WHERE (Before Grouping)
WHERE filters rows before they're grouped:
-- Average GPA per major, but only for students with GPA > 3.0
SELECT
major,
AVG(gpa) AS avg_gpa
FROM students
WHERE gpa > 3.0
GROUP BY major;
Execution Order:
- FROM: Get all students
- WHERE: Filter to students with GPA > 3.0
- GROUP BY: Group filtered students by major
- SELECT AVG(): Calculate average for each group
Students with GPA β€ 3.0 are excluded before grouping happens!
7. Filtering with HAVING (After Grouping)
What if you want to filter based on the aggregate result? That's where HAVING comes in:
-- Show only majors with average GPA above 3.5
SELECT
major,
AVG(gpa) AS avg_gpa
FROM students
GROUP BY major
HAVING AVG(gpa) > 3.5;
WHERE vs. HAVING
WHERE
- Filters individual rows
- Happens before grouping
- Can't use aggregates
- Example:
WHERE gpa > 3.0
HAVING
- Filters groups
- Happens after grouping
- Can use aggregates
- Example:
HAVING AVG(gpa) > 3.5
Example: Majors with More Than 30 Students
SELECT
major,
COUNT(*) AS student_count
FROM students
GROUP BY major
HAVING COUNT(*) > 30;
This filters out majors that have 30 or fewer students. The filtering happens after counting!
8. The Complete SQL Execution Order
Understanding the order SQL processes your query is crucial:
- FROM - Get the table
- WHERE - Filter individual rows
- GROUP BY - Create groups
- HAVING - Filter groups
- SELECT - Choose columns and calculate aggregates
- ORDER BY - Sort results
- LIMIT - Limit number of rows returned
Example Using All Clauses:
SELECT
major,
AVG(gpa) AS avg_gpa,
COUNT(*) AS student_count
FROM students
WHERE year >= 2023 -- 1. Filter: only recent students
GROUP BY major -- 2. Group by major
HAVING AVG(gpa) > 3.4 -- 3. Filter: only high-performing majors
ORDER BY avg_gpa DESC -- 4. Sort: highest GPA first
LIMIT 5; -- 5. Show: top 5 majors only
9. Common Beginner Mistakes
Mistake #1: Missing Column in GROUP BY
-- β WRONG: name is in SELECT but not in GROUP BY
SELECT name, major, AVG(gpa)
FROM students
GROUP BY major;
-- β
CORRECT: Remove name, or add it to GROUP BY
SELECT major, AVG(gpa)
FROM students
GROUP BY major;
Mistake #2: Using WHERE Instead of HAVING
-- β WRONG: Can't use aggregates in WHERE
SELECT major, AVG(gpa)
FROM students
WHERE AVG(gpa) > 3.5
GROUP BY major;
-- β
CORRECT: Use HAVING for aggregates
SELECT major, AVG(gpa)
FROM students
GROUP BY major
HAVING AVG(gpa) > 3.5;
Mistake #3: Forgetting GROUP BY Entirely
-- β WRONG: major + aggregate without GROUP BY
SELECT major, AVG(gpa)
FROM students;
-- β
CORRECT: Add GROUP BY
SELECT major, AVG(gpa)
FROM students
GROUP BY major;
Mistake #4: Wrong Column Order Expectation
-- Both of these work and give the same result:
SELECT major, AVG(gpa) FROM students GROUP BY major;
SELECT AVG(gpa), major FROM students GROUP BY major;
-- But the column display order is different!
-- First query: major column, then avg_gpa column
-- Second query: avg_gpa column, then major column
Mistake #5: Using HAVING Without GROUP BY
-- β WRONG: HAVING without GROUP BY
SELECT AVG(gpa)
FROM students
HAVING AVG(gpa) > 3.5;
-- β
CORRECT: Use WHERE for non-grouped queries
SELECT AVG(gpa)
FROM students
WHERE gpa > 3.5;
10. π― Hands-On Practice Exercises
Exercise 1: Basic GROUP BY
Task: Count how many students are in each major.
Show Answer
SELECT
major,
COUNT(*) AS student_count
FROM students
GROUP BY major;
Exercise 2: Average per Group
Task: Find the average GPA for each major.
Show Answer
SELECT
major,
AVG(gpa) AS average_gpa
FROM students
GROUP BY major;
Exercise 3: Multiple Aggregates
Task: For each major, show the count, average GPA, minimum GPA, and maximum GPA.
Show Answer
SELECT
major,
COUNT(*) AS student_count,
AVG(gpa) AS avg_gpa,
MIN(gpa) AS lowest_gpa,
MAX(gpa) AS highest_gpa
FROM students
GROUP BY major;
Exercise 4: GROUP BY with WHERE
Task: Find the average GPA per major, but only include students with GPA β₯ 3.0.
Show Answer
SELECT
major,
AVG(gpa) AS avg_gpa
FROM students
WHERE gpa >= 3.0
GROUP BY major;
Exercise 5: Using HAVING
Task: Show majors that have more than 40 students.
Show Answer
SELECT
major,
COUNT(*) AS student_count
FROM students
GROUP BY major
HAVING COUNT(*) > 40;
Exercise 6: Multi-Column Grouping
Task: Find the average GPA for each combination of major and year.
Show Answer
SELECT
major,
year,
AVG(gpa) AS avg_gpa
FROM students
GROUP BY major, year;
Challenge Exercise: Complete Query
Task: Find majors with average GPA above 3.5, but only include students from 2023 or later. Show the top 3 majors by average GPA.
Show Answer
SELECT
major,
AVG(gpa) AS avg_gpa,
COUNT(*) AS student_count
FROM students
WHERE year >= 2023
GROUP BY major
HAVING AVG(gpa) > 3.5
ORDER BY avg_gpa DESC
LIMIT 3;
Execution order: WHERE filters rows β GROUP BY creates groups β HAVING filters groups β ORDER BY sorts β LIMIT restricts output
π Key Takeaways
- GROUP BY divides data into groups and calculates aggregates per group
- Produces one row per unique group
- Golden Rule: Non-aggregated columns in SELECT must be in GROUP BY
- Can group by single or multiple columns
- WHERE filters rows before grouping
- HAVING filters groups after grouping
- Execution order: FROM β WHERE β GROUP BY β HAVING β SELECT β ORDER BY β LIMIT
- GROUP BY column order doesn't matter (but SELECT order does)
- Can combine multiple aggregates in one query
- Essential for business analytics, dashboards, and reports
π What's Next?
Congratulations! You've mastered one of SQL's most powerful features. GROUP BY unlocks advanced data analysisβyou can now answer complex business questions like "Which products sell best per region?" or "What's our revenue trend by month?" Next, you'll learn about table joins, which let you combine data from multiple tables. This is where SQL really becomes powerful!
Up next: Introduction to JOINs - Combining Data from Multiple Tables