MySQL: Why is it that including all columns in group by is not allowing the table to aggregate the data? - Databases

TopAnswers Databases

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

MySQL: Why is it that including all columns in group by is not allowing the table to aggregate the data?

mysql add tag

anoldmaninthesea

I'm using Viescas' *SQL Queries for Mere Mortals*, and its datasets.

If I run this code:

    select c.customerid, concat( c.custLastName, ", ", c.custFirstName) as CustFullName,
    e.EntStageName, ms.stylename, mp.preferenceseq, count(mp.StyleID)
    from customers as c
    inner join musical_preferences as mp
    on mp.CustomerID = c.CustomerID
    inner join musical_styles as ms 
    on ms.StyleID = mp.StyleID
    inner join entertainer_styles as es
    on es.StyleID = ms.StyleID
    inner join entertainers as e
    on e.EntertainerID = es.EntertainerID
    group by c.customerid, CustFullName, e.EntStageName, ms.stylename, mp.preferenceseq
    order by CustFullname;

I get the following:

    10006	Berg, Matt	          Jim Glynn	           Folk	    2	1
    10006	Berg, Matt	         Susan McLain	     Folk	    2	1
    10006	Berg, Matt	              Topazz        Variety 	1	1
    10006	Berg, Matt	        Modern Dance	    Variety	    1	1
    10011	Bonnicksen, Joyce	Katherine Ehrlich	Standards	1	1
    10011	Bonnicksen, Joyce	Carol Peacock Trio	Standards	1	1

Where the last column is the count function.

Why is it that this select is not agregating the data, and repeating the customersid several times, like for customer `Berg, Matt`?

I removed `e.EntStageName, ms.stylename, and mp.preferenceseq` of the group by, and then the select statement correctly grouped the data...

The book states that I should include all the columns in the group by. However, if I do that, on this specific example, I won't get a table with aggregated values. In other examples from the book, the examples with group by seems to run without this specific problem showing up.

Is there a way to include all the columns in the group by and obtain a table with the desired agregation?

P.S:I've also asked this on StackOverflow, but the comment sent me to fiddle and that to here... =D

Top Answer

Andriy M

When you add a `GROUP BY` clause to your query, it means the output will have exactly one row per each distinct combination of columns/expressions provided in the `GROUP BY` clause.  The underlying dataset may have more than one row matching any specific combination, but the output will produce only one of each.

This way a group of rows sharing the same values is condensed to a single row.  The idea behind that is to enable you to obtain various aggregate values for each group, like row count, for instance.  A `GROUP BY` query is necessarily an *aggregate query only*, it cannot give you both aggregate data and detail data (individual values of columns not included in the `GROUP BY`).

### Solution 1: Derived table + JOIN

If you want to combine aggregate results with details, the classical method is to use a separate aggregate query as a derived table and join it back to the query that provides details.  The aggregate query must expose a column or columns that can be used as a key for joining with the details query.

Often, coming up with a separate aggregate query in addition to the details query means you have to repeat parts of the logic.  In this case, as far as I can see, you would need to repeat some (though not all) of the joins.  I believe the following should be enough:

```
select
  mp.CustomerID,
  count(mp.StyleID) as RowCount
from
  musical_preferences as mp
  inner join musical_styles as ms on ms.StyleID = mp.StyleID
  inner join entertainer_styles as es on es.StyleID = ms.StyleID
group by
  mp.CustomerID
;
```

That should give you an output like this:

CustomerID|RowCount
-:|-:
10006|4
10011|2
...|...

Using that query as a derived table, you can join it back to your current query (where you will no longer need its own `GROUP BY` line):

```
select
  c.CustomerID,
  concat( c.custLastName, ", ", c.custFirstName) as CustFullName,
  e.EntStageName,
  ms.stylename,
  mp.preferenceseq,
  agg.RowCount  -- this replaces the `COUNT()`
from
  customers as c
  inner join musical_preferences as mp on mp.CustomerID = c.CustomerID
  inner join musical_styles as ms on ms.StyleID = mp.StyleID
  inner join entertainer_styles as es on es.StyleID = ms.StyleID
  inner join entertainers as e on e.EntertainerID = es.EntertainerID
  inner join
  (
    select
      mp.CustomerID,
      count(mp.StyleID) as RowCount
    from
      musical_preferences as mp
      inner join musical_styles as ms on ms.StyleID = mp.StyleID
      inner join entertainer_styles as es on es.StyleID = ms.StyleID
    group by
      mp.CustomerID
  ) as agg on agg.CustomerID = c.CustomerID
-- GROUP BY is gone
order by
  CustFullname
;
```

And the result should be something like this:

CustomerID|CustFullName|EntStageName|stylename|preferenceseq|RowCount
-:|-|-|-|-:|-:
10006|Berg, Matt|Jim Glynn   |Folk   |2|4
10006|Berg, Matt|Susan McLain|Folk   |2|4
10006|Berg, Matt|Topazz      |Variety|1|4
10006|Berg, Matt|Modern Dance|Variety|1|4
10011|Bonnicksen, Joyce|Katherine Ehrlich|Standards|1|2
10011|Bonnicksen, Joyce|Carol Peacock Trio|Standards|1|2
...|...|...|...|...|...

### Solution 2: Window aggregation

The more recent (also more efficient and certainly more convenient) method is to use window aggregation.  The whole idea of window aggregation is exactly what you are looking for: getting aggregate results while still being able to return details as well.  This has been achieved by moving the row grouping definition from the query to the aggregate function.  To use `COUNT()` as an example, the window version of it would have the following syntax:

```
COUNT(...) OVER (PARTITION BY ... /* list of grouping columns/expressions */)
```

The `PARTITION BY` subclause acts like `GROUP BY` but for the `COUNT()` function only.  Where rows in the output have the same values for columns or expressions specified in `PARTITION BY`, the function will produce the same values.  (Please note that there are actually more subclauses to window aggregate functions, but for the purposes of this answer only `PARTITION BY` is relevant.)

Your query, therefore, would look like this:

```
select
  c.CustomerID,
  concat( c.custLastName, ", ", c.custFirstName) as CustFullName,
  e.EntStageName,
  ms.stylename,
  mp.preferenceseq,
  count(mp.StyleID) over (partition by c.CustomerID) as RowCount
                                /* or: mp.CustomerID */
from
  customers as c
  inner join musical_preferences as mp on mp.CustomerID = c.CustomerID
  inner join musical_styles as ms on ms.StyleID = mp.StyleID
  inner join entertainer_styles as es on es.StyleID = ms.StyleID
  inner join entertainers as e on e.EntertainerID = es.EntertainerID
-- GROUP BY is not needed
order by
  CustFullname
;
```

Note that you do not need to specify both `CustomerID` (from whichever table) and the `CustFullName` expression.  The `CustomerID` column is enough because it uniquely identifies the groups that you want to obtain row counts of.

Given how much simpler this approach is than the classical method, it is easy to see that this is the preferred method for scenarios like yours.  Not only the query becomes simpler, but usually this means the query plan will be more efficient as well, because window aggregation is supposed to be applied to the same row set that you are returning.  It means there is no additional joining, no table derivation – therefore, no need to hit the underlying tables an extra time.

The only issue is that the window aggregate functions must be supported by the database product.  By now probably all the major RDBMS products have implemented window functions[^mysql8], so you just need to make sure you are using the latest version.

[^mysql8]: In particular, MySQL has added support for window functions in version 8.0.

1 Answer