[server]: remove orphan files and directories when tablet starts up#3388
[server]: remove orphan files and directories when tablet starts up#3388gyang94 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes orphan replica directory leaks on TabletServer restart during table/partition deletion (issue #3387). Adds two cleanup paths corresponding to the two scenarios in the issue: a startup-time empty parent/table-dir sweep in LogManager's SchemaNotExistException handler (table-deletion case) and a new sweepOrphanTabletDirs invoked from ReplicaManager.stopReplicas when a NoneReplica receives deleteLocal=true (partition-deletion case).
Changes:
ReplicaManager: newsweepOrphanTabletDirsdrops the orphan log viaLogManager, removes the sibling KV tablet (viaKvManager.dropKvor directFileUtils.deleteDirectory), updatesLocalDiskManageraccounting, and registers the parent dir for empty-dir cleanup.LogManager: after deletinglog-N/andkv-N/onSchemaNotExistException, also remove the now-empty parent dir, and for partitioned tables additionally remove the empty grandparent (table) dir.StopReplicaITCase: two new IT tests covering table-drop (startup cleanup) and partition-drop (orphan-sweep on stopReplica) while a TabletServer is offline.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
fluss-server/src/main/java/org/apache/fluss/server/replica/ReplicaManager.java |
Handles NoneReplica + deleteLocal=true by sweeping orphan log/KV dirs loaded at startup. |
fluss-server/src/main/java/org/apache/fluss/server/log/LogManager.java |
Cleans up empty partition/table parent directories after residual tablet dirs are deleted. |
fluss-server/src/test/java/org/apache/fluss/server/coordinator/StopReplicaITCase.java |
New IT tests for the two orphan-cleanup paths during offline drop. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Tuple2<PhysicalTablePath, TableBucket> pathAndBucket = | ||
| FlussPaths.parseTabletDir(tabletDir); | ||
| try { |
| logManager.dropLog(tb); | ||
|
|
||
| boolean isKvTable = false; | ||
| if (kvManager.getKv(tb).isPresent()) { | ||
| kvManager.dropKv(tb); | ||
| isKvTable = true; | ||
| } else { | ||
| File kvTabletDir = FlussPaths.kvTabletDir(dataDir, physicalTablePath, tb); | ||
| if (kvTabletDir.exists()) { | ||
| isKvTable = true; | ||
| try { | ||
| FileUtils.deleteDirectory(kvTabletDir); | ||
| } catch (IOException e) { | ||
| throw new KvStorageException( | ||
| String.format( | ||
| "Failed to delete orphan KV tablet directory %s", kvTabletDir), | ||
| e); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| localDiskManager.recordReplicaDelete(dataDir, isKvTable); | ||
|
|
||
| if (tb.getPartitionId() != null) { | ||
| deletedPartitionIds.put(tb.getPartitionId(), tabletParentDir); | ||
| deletedTableIds.put(tb.getTableId(), tabletParentDir.getParent()); | ||
| } else { | ||
| deletedTableIds.put(tb.getTableId(), tabletParentDir); | ||
| } |
7757a66 to
1bf4582
Compare
swuferhong
left a comment
There was a problem hiding this comment.
Hi @gyang94. Thanks for your great work. I left some comments, pls take a look.
The commit msg has a typo: "remote orphan files" — should be "remove orphan files".
| // database dir — must NOT remove. Safe under parallel | ||
| // execution: the last job to finish finds the dir | ||
| // empty and removes it; deleteDirectoryQuietly | ||
| // tolerates races. |
| if (parentDir != null && FileUtils.isDirectoryEmpty(parentDir)) { | ||
| FileUtils.deleteDirectoryQuietly(parentDir); |
There was a problem hiding this comment.
There's a TOCTOU race: between the isDirectoryEmpty check and deleteDirectoryQuietly, another thread/loadLog task could create a file. Consider using Files.delete(path) which fails on non-empty dirs, making this truly safe under concurrency.
| FileUtils.deleteDirectoryQuietly(parentDir); | ||
| if (isPartitioned) { | ||
| File tableDir = parentDir.getParentFile(); | ||
| if (tableDir != null && FileUtils.isDirectoryEmpty(tableDir)) { |
There was a problem hiding this comment.
For the table drop case, the startup cleanup via SchemaNotExistException works well.
However, for the partition drop case (table still exists, only partition dropped while TS was offline), the orphan files won't be auto-cleaned. When the TS restarts, loadLog() succeeds because the table schema still exists in ZK, so the log is loaded normally as a NoneReplica. The sweepOrphanTabletDirs mechanism requires a stopReplica(delete=true) to trigger, but the Coordinator won't re-send it — by the time the TS reconnects, the Coordinator has already marked the deletion as "successful" (after retry exhaustion) and removed all tracking metadata for that partition's replicas.
I think you need consider adding a startupcheck for partition existence (similar to the schema check) to cover this gap.
1bf4582 to
ef8f194
Compare
Purpose
Linked issue: close #3387
Brief change log
Two changes, one in each cleanup path:
1.
ReplicaManager: handleNoneReplicawithdeleteLocal=trueWhen the
NoneReplicabranch receivesdeleteLocal=true, look up the bucket inLogManager.currentLogs. If present, the log tablet was loaded at startup but never registered inallReplicas— it is an orphan. Drop the log vialogManager.dropLog(), delete the sibling KV tablet directory (viakvManager.dropKv()if loaded, otherwise directFileUtils.deleteDirectory), update disk usage accounting, and record the parent directory for empty-dir cleanup.This handles the partition deletion scenario.
2.
LogManager: clean up empty parent directories inSchemaNotExistExceptionhandlerAfter the existing handler deletes
log-N/andkv-N/, check whether the parent directory is empty and delete it. For partitioned tables, also check the grandparent (table directory). For non-partitioned tables, the grandparent is the database directory and must NOT be deleted. This is safe under parallelloadAllLogsexecution —deleteDirectoryQuietlytolerates races.This handles the table deletion scenario.
Tests
API and Format
Documentation