mirror of
https://github.com/doctrine/orm.git
synced 2026-03-23 22:42:18 +01:00
Pagination CountWalker should run a COUNT(*) query instead of COUNT(tbl.id) when HINT_DISTINCT is false #7396
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @d-ph on GitHub (Jul 16, 2024).
Feature Request
Summary
I'd like to propose that the
Doctrine\ORM\Tools\Pagination\CountWalkershould create a count query that selectsCOUNT(*)instead ofCOUNT(tbl.id)when a query'sHINT_DISTINCTis set/declaredfalse. Both "counts" result in the same number being produced, howeverCOUNT(*)allows some databases (e.g. Postgres) to finish the query faster. Please see the following query plans.Notice how in the case of
COUNT(*), Postgres is counting using an "Index Only Scan". The speed difference isn't substantial (at least for the number of rows I tested with), but it's also quite cheap to obtain (by just making the code useCOUNT(*)).Thank you.
@greg0ire commented on GitHub (Jul 16, 2024):
Do you have an explanation as to why the query plan differs from one query to the other, and are you sure that explanation will apply regardless of the database in use? I'm not sure we should introduce more complexity, especially if it is not possible to measure a substantial speed difference. But I agree that "Seq Scan" does not sound good.
@d-ph commented on GitHub (Jul 17, 2024):
I've done some research just now and the short answer appears to be that
COUNT(*)gives more opportunities to the database to seek alternative ways to fulfil the query than whenCOUNT(tbl.id)is used.My other observations:
COUNT(column_name)tells the database to count not-null values of the column_name, whileCOUNT(*)tells the database to just count the rows, regardless of their contents. Postgres in particular is sometimes willing to use an index-only scan to fulfil theCOUNT(*)query. The apparent non-obvious thing here is thatCOUNT(*)is an equivalent ofCOUNT(), but it's notCOUNT()because that's a syntax error in SQL.COUNT(*)allows to use any index available on the table in question, whileCOUNT(tbl.id)allows to use only the (usually primary) index on theidcolumn. This for some reason is sometimes less eagerly used if there's been a period of time since the lastvacuum analyzewas carried out on the table in question, because when Ivacuum analyze'd my test table, then Postgres started using an index-only scan also for theCOUNT(tbl.id)query.COUNT(tbl.id).I'm not sure but it's very plausible. For what it's worth, chatgpt is under impression (and bullet-points some examples) that most of the databases benefit from being told to do
COUNT(*)instead ofCOUNT(column_name), when the column_name is not nullable (and primary keys aren't supposed to be nullable).After going down the rabbit hole here, I'm also less inclined now to make this change. The only standing argument for that change is what I said at the top:
COUNT(*)gives more opportunities for optimised querying thanCOUNT(tbl.id), however, as I later said, I wasn't able to confirm spectacular speed difference in my test data.For this reason I'm fine to close this ticket now without any code change, unless you'd like (me) to make that change anyway due to the reason I explained above.
@greg0ire commented on GitHub (Jul 17, 2024):
Thanks for the very detailed explanation. From what you are telling, I'm starting to think that to observe a significant performance difference, we would have to add a where clause involving columns other than the primary key (I'm leaving aside the case of composite primary keys for the sake of simplicity). Is that what you observe ?
@d-ph commented on GitHub (Jul 18, 2024):
Interesting. You're absolutely right. When I add a simple where-condition on a column that is indexed, the index-only scan is never used in the case of
COUNT(tbl.id)(regardless of whether the table isvacuum'd or not), but is always used whenCOUNT(*)is used. In this particular case, I can see 50-75ms speed difference in favour ofCOUNT(*). Whether that's significant is open to debate.The
CountWalkeralready throws an exception when the table being queried uses a composite primary key. So we don't need to discuss that case here.So yeah, still down to you if you wish to carry on with this code change. I'm somewhat more inclined to it now.
Edit: I just tried different where-conditions, and as one could imagine, if the where-condition filters out significant amount of the table, the speed difference is significant. E.g. 0.1s vs 0.001s.
@greg0ire commented on GitHub (Jul 18, 2024):
Great! Using Git, can you find out why we are not using
(*)in the first place? Likewise, if you switch the code to(*)< do any tests break?@d-ph commented on GitHub (Jul 19, 2024):
Git history shows that the code does what it does since the very beginning (2012) and it wasn't attempted to change it to conditionally doing
COUNT(*).I made that change and no tests failed. In particular,
CountWalkerTestdidn't fail because it never checks for the case when the HINT_DISTINCT is set to false.I checked this code change in my webapp and it's effective (i.e.
COUNT(*)is run conditionally), and I didn't experience any regressions.@greg0ire commented on GitHub (Jul 19, 2024):
Then please send a PR against the next minor branch, I think I'm OK with this change 👍
@d-ph commented on GitHub (Jul 19, 2024):
Could you tell me whether you mean the
3.3.xor the3.2.xbranch, please?It that's an option, I would backport it to
2.20.xand then "replay" the code change from2.20.xto every branch above it.@greg0ire commented on GitHub (Jul 19, 2024):
2.x is only open for bugfixes. This does not look like a bug at all. The next minor branch is 3.3.x (since the last tag is 3.2.1).