Use time-based retention for MSQ query result file cleanup#19074
Use time-based retention for MSQ query result file cleanup#19074cecemei merged 6 commits intoapache:masterfrom
Conversation
TaskLogAutoCleanerConfig.| DurableStorageUtils.QUERY_RESULTS_DIR.equals(nextDirName) | ||
| && DurableStorageUtils.isQueryResultFileActive( | ||
| currentFile, | ||
| taskId -> Optional.fromNullable(taskStorage.getTaskInfo(taskId)).transform(TaskInfo::getCreatedTime), |
There was a problem hiding this comment.
This is going to be expensive; each usage of taskStorage is a call to the metadata store. It's better to batch these, such as by using one call to get all of the task IDs that have completed within the retention period. Then you can delete anything that doesn't appear in that list.
There was a problem hiding this comment.
Btw, ideally that call to get all those task IDs should only look at type query_controller. It will generally be a much smaller and more manageable list than "all tasks".
There was a problem hiding this comment.
updated to only fetch completed task within retention period, we don't have filter by type in CompleteTaskLookup, it wont be too much work to add such filter. i just assume that even without this filter we wont have too much tasks finished within the last 6 hours.
gianm
left a comment
There was a problem hiding this comment.
Looks good to me but please consider the comments prior to merging. They could simplify things.
Description
Adds a configurable retention duration (default: 6 hours) to
DurableStorageCleanerConfigand updates the cleaner to retain query result files based on task creation time rather than checking known task IDs.Release Note
The durable storage cleaner now supports configurable time-based retention for MSQ query results. Previously, query results were retained for all known tasks list, which was unreliable for completed tasks. With this change, query results are retained for a configurable time period based on the task creation time.
The new configuration property
druid.msq.intermediate.storage.cleaner.durationToRetaincontrols the retention period for query results. The default retention period is 6 hours.This PR has: