Interest in bulk / batch insert operations when not using post-insert identifiers? #7517

Open
opened 2026-01-22 15:52:49 +01:00 by admin · 5 comments
Owner

Originally created by @Ocramius on GitHub (Jun 11, 2025).

Context

I'm currently working on a codebase where I identified a repeated INSERT operation being performed.

As you may know, I've been a long time advocate for using UUIDs (and specifically, UUID5 / UUID6) for synthetic entity identifiers.

Due to recent advancements in profiling technology (Sentry, Tideways, etc.), I've managed to find a typical N+1 query:

INSERT INTO foo (bar, baz) VALUES (?, ?);
INSERT INTO foo (bar, baz) VALUES (?, ?);
-- ...
-- repeats until entities of the same type are gone

What can be done?

When AUTO_INCREMENT, SERIAL or post-insert identifiers in general aren't necessary, we can perform bulk / batch INSERT operations: this reduces the number of roundtrips to the DB, which is mostly network time.

Current behavior

I've traced the operation as follows:

  1. UnitOfWork computes insert order: 05c8c5f114/src/UnitOfWork.php (L1038-L1040)
  2. computeInsertExecutionOrder produces a sorted list of INSERT operations: 05c8c5f114/src/UnitOfWork.php (L1191-L1246)
  3. EntityPersister#addInsert() is called in a loop: 05c8c5f114/src/UnitOfWork.php (L1038-L1048)
  4. in the same loop, EntityPersister#executeInserts() is called: 05c8c5f114/src/UnitOfWork.php (L1048-L1052)
  5. EntityPersister#executeInserts() creates a prepared statement, and fires it at the DB: 05c8c5f114/src/Persisters/Entity/BasicEntityPersister.php (L237-L285)

The plan

The idea is to either change UnitOfWork#executeInserts() or UnitOfWork#computeInsertExecutionOrder() to produce a multidimensional array of micro-batches, where batches are only possible:

  • for the same entity type
  • for entities with an already assigned identifier

I would then extract the EntityPersister#executeInserts() call, so that it does not get performed in a loop (where applicable), and therefore internally reuses the same prepared statement.

I can then also optimize the EntityPersister#executeInserts() logic to generate SQL for a full batch (single statement).

The plan does NOT involve any DBAL API addition/change.

The question

Is there interest in this optimization?
Specifically, can I work on it, or is the plan flawed? This is mostly probing whether there is interest in this.

Past issues

Originally created by @Ocramius on GitHub (Jun 11, 2025). ## Context I'm currently working on a codebase where I identified a repeated `INSERT` operation being performed. As you may know, I've been a long time advocate for using UUIDs (and specifically, UUID5 / UUID6) for synthetic entity identifiers. Due to recent advancements in profiling technology (Sentry, Tideways, etc.), I've managed to find a typical N+1 query: ```sql INSERT INTO foo (bar, baz) VALUES (?, ?); INSERT INTO foo (bar, baz) VALUES (?, ?); -- ... -- repeats until entities of the same type are gone ``` ## What can be done? When `AUTO_INCREMENT`, `SERIAL` or post-insert identifiers in general aren't necessary, we can perform bulk / batch `INSERT` operations: this reduces the number of roundtrips to the DB, which is mostly network time. ## Current behavior I've traced the operation as follows: 1. `UnitOfWork` computes insert order: https://github.com/doctrine/orm/blob/05c8c5f114fe9c8540e105f5f1c4f592253bf116/src/UnitOfWork.php#L1038-L1040 2. `computeInsertExecutionOrder` produces a sorted list of `INSERT` operations: https://github.com/doctrine/orm/blob/05c8c5f114fe9c8540e105f5f1c4f592253bf116/src/UnitOfWork.php#L1191-L1246 3. `EntityPersister#addInsert()` is called in a loop: https://github.com/doctrine/orm/blob/05c8c5f114fe9c8540e105f5f1c4f592253bf116/src/UnitOfWork.php#L1038-L1048 4. in the same loop, `EntityPersister#executeInserts()` is called: https://github.com/doctrine/orm/blob/05c8c5f114fe9c8540e105f5f1c4f592253bf116/src/UnitOfWork.php#L1048-L1052 5. `EntityPersister#executeInserts()` creates a prepared statement, and fires it at the DB: https://github.com/doctrine/orm/blob/05c8c5f114fe9c8540e105f5f1c4f592253bf116/src/Persisters/Entity/BasicEntityPersister.php#L237-L285 ## The plan The idea is to either change `UnitOfWork#executeInserts()` or `UnitOfWork#computeInsertExecutionOrder()` to produce a multidimensional array of micro-batches, where batches are only possible: * for the same entity type * for entities with an already assigned identifier I would then extract the `EntityPersister#executeInserts()` call, so that it does not get performed in a loop (where applicable), and therefore internally reuses the same prepared statement. I can then also optimize the `EntityPersister#executeInserts()` logic to generate SQL for a full batch (single statement). The plan does **NOT** involve any DBAL API addition/change. ## The question Is there interest in this optimization? Specifically, can I work on it, or is the plan flawed? This is mostly probing whether there is interest in this. ## Past issues * https://github.com/doctrine/orm/issues/2626 * https://github.com/doctrine/dbal/pull/682
Author
Owner

@beberlei commented on GitHub (Jun 11, 2025):

@Ocramius just re-read and thought you ment "for post id generators", but you meant exclusively not for them.

So yes, I am very much in favor of this. See https://www.baeldung.com/jpa-hibernate-batch-insert-update for how Hibernate does this / allows to configure this. A global batch size variable that defaults to 1. We could start there.

@beberlei commented on GitHub (Jun 11, 2025): @Ocramius just re-read and thought you ment "for post id generators", but you meant exclusively not for them. So yes, I am very much in favor of this. See https://www.baeldung.com/jpa-hibernate-batch-insert-update for how Hibernate does this / allows to configure this. A global batch size variable that defaults to 1. We could start there.
Author
Owner

@Ocramius commented on GitHub (Jun 11, 2025):

I'd be fine with just reusing the same prepared statement, then making a second patch with the batch size 👍

@Ocramius commented on GitHub (Jun 11, 2025): I'd be fine with just reusing the same prepared statement, then making a second patch with the batch size 👍
Author
Owner

@Ocramius commented on GitHub (Jun 11, 2025):

@beberlei a first implementation (without touching the persisters) is at #11978

@Ocramius commented on GitHub (Jun 11, 2025): @beberlei a first implementation (without touching the persisters) is at #11978
Author
Owner

@christian-kolb commented on GitHub (Jun 12, 2025):

This would be amazing for replay of projections where thousands of events are processed in one batch and therefore thousands of entities might be inserted in one batch. Network time is most of the execution time in one of our projects partially due to this reason!

@christian-kolb commented on GitHub (Jun 12, 2025): This would be amazing for replay of projections where thousands of events are processed in one batch and therefore thousands of entities might be inserted in one batch. Network time is most of the execution time in one of our projects partially due to this reason!
Author
Owner

@Ocramius commented on GitHub (Jun 13, 2025):

#11979 implements the configurable batch size for the BasicEntityPersister, and only where the identifier is known upfront

@Ocramius commented on GitHub (Jun 13, 2025): #11979 implements the configurable batch size for the `BasicEntityPersister`, and only where the identifier is known upfront
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: doctrine/archived-orm#7517