Currently, this feature always triggers when there's an ingestion delay detected at segment creation. This feature effectively doubles the ingestion which we might not have resources for sometimes. There are no mechanism to stop it if there's any issues or if the cluster is overwhelmed specifically a topic is constantly triggering this feature or if all the topics on a cluster trigger this feature at once. We need some circuit breaking mechanisms.
Proposal
- Introduce a backfill pause property
stream.kafka.consumer.prop.auto.offset.reset.pause. This can be set programmatically (through a controller API that will be added) or through an automatic process details below. This property will be checked before the backfill is triggered at segment creation.
- Add another property stream.kafka.consumer.prop.auto.offset.reset.maxSegmentsBeforeBackfillSkip - If the number of segments on the table are already high, we will not trigger backfill. In case ingestion is permanently high and not spiky, the ingestion doubling is causing more segments to be created leading to znode limit being reached faster.
- Add a dsc property maxConcurrentBackfillsPerController. Per controller if there's more than this number of backfills ongoing, we will not trigger any more backfills. In the case where the cluster was down and restarted after a while, all topics will have ingestion lag. This means backfill will trigger for all tables. This is not really necessary in the case because cluster can self recover. Instead if all topics backfill, we will create a lot of topics for this. Since topics cannot be cleanly removed, we will need to carry these topics in the stream config forever which is not ideal.
- If a backfill is ongoing, we will not trigger another backfill until the current backfill is complete. We will also pause the backfill using the property in point 1. and emit a metric in this case. This featurre was built for occasional spikes. If backfills are constantly triggering then there's more throughput than expected and we should be increasing the number of partitions on the main topic.
Currently, this feature always triggers when there's an ingestion delay detected at segment creation. This feature effectively doubles the ingestion which we might not have resources for sometimes. There are no mechanism to stop it if there's any issues or if the cluster is overwhelmed specifically a topic is constantly triggering this feature or if all the topics on a cluster trigger this feature at once. We need some circuit breaking mechanisms.
Proposal
stream.kafka.consumer.prop.auto.offset.reset.pause. This can be set programmatically (through a controller API that will be added) or through an automatic process details below. This property will be checked before the backfill is triggered at segment creation.