Pagination CountWalker should run a COUNT(*) query instead of COUNT(tbl.id) when HINT_DISTINCT is false #7396

New Issue

admin · 2026-01-22T15:51:12+01:00

admin commented

2026-01-22 15:51:12 +01:00

Originally created by @d-ph on GitHub (Jul 16, 2024).

Feature Request

Q	A
New Feature	no
RFC	no
BC Break	no

Summary

I'd like to propose that the Doctrine\ORM\Tools\Pagination\CountWalker should create a count query that selects COUNT(*) instead of COUNT(tbl.id) when a query's HINT_DISTINCT is set/declared false. Both "counts" result in the same number being produced, however COUNT(*) allows some databases (e.g. Postgres) to finish the query faster. Please see the following query plans.

// COUNT(tbl.id) case
EXPLAIN SELECT count(o0_.id) AS sclr_0 FROM "order" o0_;

QUERY PLAN
Finalize Aggregate  (cost=40598.68..40598.69 rows=1 width=8)
  ->  Gather  (cost=40598.46..40598.67 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=39598.46..39598.47 rows=1 width=8)
              ->  Parallel Seq Scan on ""order"" o0_  (cost=0.00..38594.17 rows=redacted width=4)

---

// COUNT(*) case
EXPLAIN SELECT count(*) AS sclr_0 FROM "order" o0_;

QUERY PLAN
Finalize Aggregate  (cost=36422.71..36422.72 rows=1 width=8)
  ->  Gather  (cost=36422.49..36422.70 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=35422.49..35422.50 rows=1 width=8)
              ->  Parallel Index Only Scan using idx_f5299398f543e763 on ""order"" o0_  (cost=0.42..34418.20 rows=redacted width=0)

Notice how in the case of COUNT(*), Postgres is counting using an "Index Only Scan". The speed difference isn't substantial (at least for the number of rows I tested with), but it's also quite cheap to obtain (by just making the code use COUNT(*)).

Thank you.

Originally created by @d-ph on GitHub (Jul 16, 2024). ### Feature Request  | Q | A |------------ | ------ | New Feature | no | RFC | no | BC Break | no #### Summary I'd like to propose that the `Doctrine\ORM\Tools\Pagination\CountWalker` should create a count query that selects `COUNT(*)` instead of `COUNT(tbl.id)` when a query's `HINT_DISTINCT` is set/declared `false`. Both "counts" result in the same number being produced, however `COUNT(*)` allows some databases (e.g. Postgres) to finish the query faster. Please see the following query plans. ``` // COUNT(tbl.id) case EXPLAIN SELECT count(o0_.id) AS sclr_0 FROM "order" o0_; QUERY PLAN Finalize Aggregate (cost=40598.68..40598.69 rows=1 width=8) -> Gather (cost=40598.46..40598.67 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=39598.46..39598.47 rows=1 width=8) -> Parallel Seq Scan on ""order"" o0_ (cost=0.00..38594.17 rows=redacted width=4) --- // COUNT(*) case EXPLAIN SELECT count(*) AS sclr_0 FROM "order" o0_; QUERY PLAN Finalize Aggregate (cost=36422.71..36422.72 rows=1 width=8) -> Gather (cost=36422.49..36422.70 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=35422.49..35422.50 rows=1 width=8) -> Parallel Index Only Scan using idx_f5299398f543e763 on ""order"" o0_ (cost=0.42..34418.20 rows=redacted width=0) ``` Notice how in the case of `COUNT(*)`, Postgres is counting using an "Index Only Scan". The speed difference isn't substantial (at least for the number of rows I tested with), but it's also quite cheap to obtain (by just making the code use `COUNT(*)`). Thank you.

admin commented

2026-01-22 15:51:13 +01:00

@greg0ire commented on GitHub (Jul 16, 2024):

Do you have an explanation as to why the query plan differs from one query to the other, and are you sure that explanation will apply regardless of the database in use? I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference. But I agree that "Seq Scan" does not sound good.

@greg0ire commented on GitHub (Jul 16, 2024): Do you have an explanation as to why the query plan differs from one query to the other, and are you sure that explanation will apply regardless of the database in use? I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference. But I agree that "Seq Scan" does not sound good.

admin commented

2026-01-22 15:51:14 +01:00

@d-ph commented on GitHub (Jul 17, 2024):

Do you have an explanation as to why the query plan differs from one query to the other

I've done some research just now and the short answer appears to be that COUNT(*) gives more opportunities to the database to seek alternative ways to fulfil the query than when COUNT(tbl.id) is used.

My other observations:

A lot of sources state that COUNT(column_name) tells the database to count not-null values of the column_name, while COUNT(*) tells the database to just count the rows, regardless of their contents. Postgres in particular is sometimes willing to use an index-only scan to fulfil the COUNT(*) query. The apparent non-obvious thing here is that COUNT(*) is an equivalent of COUNT(), but it's not COUNT() because that's a syntax error in SQL.
Despite the above, I found that both COUNT() queries allow Postgres to attempt to use an index-only scan, however COUNT(*) allows to use any index available on the table in question, while COUNT(tbl.id) allows to use only the (usually primary) index on the id column. This for some reason is sometimes less eagerly used if there's been a period of time since the last vacuum analyze was carried out on the table in question, because when I vacuum analyze'd my test table, then Postgres started using an index-only scan also for the COUNT(tbl.id) query.
The time difference between a "parallel seq scan" and a "parallel index-only scan" seems to be under 20ms for my test tables (in favour of the index-only scan). It makes me believe that Postgres is just accurately concluding that there will be no difference between the two, and just chooses to sometimes use a sequential scan in case of the COUNT(tbl.id).

are you sure that explanation will apply regardless of the database in use?

I'm not sure but it's very plausible. For what it's worth, chatgpt is under impression (and bullet-points some examples) that most of the databases benefit from being told to do COUNT(*) instead of COUNT(column_name), when the column_name is not nullable (and primary keys aren't supposed to be nullable).

I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference.

After going down the rabbit hole here, I'm also less inclined now to make this change. The only standing argument for that change is what I said at the top: COUNT(*) gives more opportunities for optimised querying than COUNT(tbl.id), however, as I later said, I wasn't able to confirm spectacular speed difference in my test data.

For this reason I'm fine to close this ticket now without any code change, unless you'd like (me) to make that change anyway due to the reason I explained above.

@d-ph commented on GitHub (Jul 17, 2024): > Do you have an explanation as to why the query plan differs from one query to the other I've done some research just now and the short answer appears to be that `COUNT(*)` gives more opportunities to the database to seek alternative ways to fulfil the query than when `COUNT(tbl.id)` is used. My other observations: 1. A lot of sources state that `COUNT(column_name)` tells the database to count not-null values of the column_name, while `COUNT(*)` tells the database to just count the rows, regardless of their contents. Postgres in particular is sometimes willing to use an index-only scan to fulfil the `COUNT(*)` query. The apparent non-obvious thing here is that `COUNT(*)` is an equivalent of `COUNT()`, but it's not `COUNT()` because that's a syntax error in SQL. 2. Despite the above, I found that both COUNT() queries allow Postgres to attempt to use an index-only scan, however `COUNT(*)` allows to use any index available on the table in question, while `COUNT(tbl.id)` allows to use only the (usually primary) index on the `id` column. This for some reason is sometimes less eagerly used if there's been a period of time since the last `vacuum analyze` was carried out on the table in question, because when I `vacuum analyze`'d my test table, then Postgres started using an index-only scan also for the `COUNT(tbl.id)` query. 3. The time difference between a "parallel seq scan" and a "parallel index-only scan" seems to be under 20ms for my test tables (in favour of the index-only scan). It makes me believe that Postgres is just accurately concluding that there will be no difference between the two, and just chooses to sometimes use a sequential scan in case of the `COUNT(tbl.id)`. > are you sure that explanation will apply regardless of the database in use? I'm not sure but it's very plausible. For what it's worth, chatgpt is under impression (and bullet-points some examples) that most of the databases benefit from being told to do `COUNT(*)` instead of `COUNT(column_name)`, when the column_name is not nullable (and primary keys aren't supposed to be nullable). > I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference. After going down the rabbit hole here, I'm also less inclined now to make this change. The only standing argument for that change is what I said at the top: `COUNT(*)` gives more opportunities for optimised querying than `COUNT(tbl.id)`, however, as I later said, I wasn't able to confirm spectacular speed difference in my test data. For this reason I'm fine to close this ticket now without any code change, unless you'd like (me) to make that change anyway due to the reason I explained above.

admin commented

2026-01-22 15:51:14 +01:00

@greg0ire commented on GitHub (Jul 17, 2024):

Thanks for the very detailed explanation. From what you are telling, I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (I'm leaving aside the case of composite primary keys for the sake of simplicity). Is that what you observe ?

@greg0ire commented on GitHub (Jul 17, 2024): Thanks for the very detailed explanation. From what you are telling, I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (I'm leaving aside the case of composite primary keys for the sake of simplicity). Is that what you observe ?

admin commented

2026-01-22 15:51:14 +01:00

@d-ph commented on GitHub (Jul 18, 2024):

I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (...) Is that what you observe?

Interesting. You're absolutely right. When I add a simple where-condition on a column that is indexed, the index-only scan is never used in the case of COUNT(tbl.id) (regardless of whether the table is vacuum'd or not), but is always used when COUNT(*) is used. In this particular case, I can see 50-75ms speed difference in favour of COUNT(*). Whether that's significant is open to debate.

I'm leaving aside the case of composite primary keys for the sake of simplicity

The CountWalker already throws an exception when the table being queried uses a composite primary key. So we don't need to discuss that case here.

So yeah, still down to you if you wish to carry on with this code change. I'm somewhat more inclined to it now.

Edit: I just tried different where-conditions, and as one could imagine, if the where-condition filters out significant amount of the table, the speed difference is significant. E.g. 0.1s vs 0.001s.

@d-ph commented on GitHub (Jul 18, 2024): > I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (...) Is that what you observe? Interesting. You're absolutely right. When I add a simple where-condition on a column that is indexed, the index-only scan is never used in the case of `COUNT(tbl.id)` (regardless of whether the table is `vacuum`'d or not), but is always used when `COUNT(*)` is used. In this particular case, I can see 50-75ms speed difference in favour of `COUNT(*)`. Whether that's significant is open to debate. > I'm leaving aside the case of composite primary keys for the sake of simplicity The `CountWalker` already throws an exception when the table being queried uses a composite primary key. So we don't need to discuss that case here. --- So yeah, still down to you if you wish to carry on with this code change. I'm somewhat more inclined to it now. --- Edit: I just tried different where-conditions, and as one could imagine, if the where-condition filters out significant amount of the table, the speed difference _is_ significant. E.g. 0.1s vs 0.001s.

admin commented

2026-01-22 15:51:15 +01:00

@greg0ire commented on GitHub (Jul 18, 2024):

Great! Using Git, can you find out why we are not using (*) in the first place? Likewise, if you switch the code to (*)< do any tests break?

@greg0ire commented on GitHub (Jul 18, 2024): Great! Using Git, can you find out why we are not using `(*)` in the first place? Likewise, if you switch the code to `(*)`< do any tests break?

admin commented

2026-01-22 15:51:15 +01:00

@d-ph commented on GitHub (Jul 19, 2024):

why we are not using (*) in the first place?

Git history shows that the code does what it does since the very beginning (2012) and it wasn't attempted to change it to conditionally doing COUNT(*).

if you switch the code to (*)< do any tests break?

I made that change and no tests failed. In particular, CountWalkerTest didn't fail because it never checks for the case when the HINT_DISTINCT is set to false.

I checked this code change in my webapp and it's effective (i.e. COUNT(*) is run conditionally), and I didn't experience any regressions.

@d-ph commented on GitHub (Jul 19, 2024): > why we are not using (*) in the first place? Git history shows that the code does what it does since the very beginning (2012) and it wasn't attempted to change it to conditionally doing `COUNT(*)`. > if you switch the code to (*)< do any tests break? I made that change and no tests failed. In particular, `CountWalkerTest` didn't fail because it never checks for the case when the HINT_DISTINCT is set to false. I checked this code change in my webapp and it's effective (i.e. `COUNT(*)` is run conditionally), and I didn't experience any regressions.

admin commented

2026-01-22 15:51:15 +01:00

@greg0ire commented on GitHub (Jul 19, 2024):

Then please send a PR against the next minor branch, I think I'm OK with this change 👍

@greg0ire commented on GitHub (Jul 19, 2024): Then please send a PR against the next minor branch, I think I'm OK with this change 👍

admin commented

2026-01-22 15:51:16 +01:00

@d-ph commented on GitHub (Jul 19, 2024):

Could you tell me whether you mean the 3.3.x or the 3.2.x branch, please?

It that's an option, I would backport it to 2.20.x and then "replay" the code change from 2.20.x to every branch above it.

@d-ph commented on GitHub (Jul 19, 2024): <img width="357" alt="image" src="https://github.com/user-attachments/assets/c761a30b-8c50-414b-9b5d-cd7a7655298c"> Could you tell me whether you mean the `3.3.x` or the `3.2.x` branch, please? It that's an option, I would backport it to `2.20.x` and then "replay" the code change from `2.20.x` to every branch above it.

admin commented

2026-01-22 15:51:16 +01:00

@greg0ire commented on GitHub (Jul 19, 2024):

2.x is only open for bugfixes. This does not look like a bug at all. The next minor branch is 3.3.x (since the last tag is 3.2.1).

@greg0ire commented on GitHub (Jul 19, 2024): 2.x is only open for bugfixes. This does not look like a bug at all. The next minor branch is 3.3.x (since the last tag is 3.2.1).

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: doctrine/archived-orm#7396