[lake/hudi] Introduce Hudi LakeCatalog to create table by fhan688 · Pull Request #3395 · apache/fluss

fhan688 · 2026-05-28T10:39:18Z

Purpose

Linked issue: #3275

This PR introduces the Hudi LakeCatalog implementation, enabling Fluss to create tables in Hudi data lake storage. This aligns with the existing Paimon and Iceberg lake catalog support, completing the trio of supported lake formats for table creation.

Brief change log

New modules & classes (`fluss-lake/fluss-lake-hudi`):

HudiLakeCatalog: Implements LakeCatalog interface, supporting both HMS (Hive Metastore) and DFS (filesystem) catalog modes. Handles table creation with schema compatibility check for crash-recovery idempotency, automatic database creation, and system column (__bucket, __offset, __timestamp) appending.
FlussDataTypeToHudiDataType: Implements DataTypeVisitor to convert Fluss data types to Flink types (Hudi's type system). Handles LocalZonedTimestampType specially: maps to BIGINT under HMS mode, TIMESTAMP_WITH_LOCAL_TIME_ZONE under DFS mode.
HudiConversions: Core conversion utility. Converts Fluss TablePath → Hudi ObjectPath, TableDescriptor → ResolvedSchema / Hudi table properties. Validates HUDI_UNSETTABLE_OPTIONS (6 protected options that Fluss auto-manages), checks system column name conflicts, and handles property prefix rewriting (hudi.xxx → xxx, others → fluss.xxx).
HudiCatalogUtils: Factory for creating Hudi Catalog instances (HoodieHiveCatalog for HMS, HoodieCatalog for DFS). Uses copied Configuration to avoid mutating the original.

Modifications to existing modules (`fluss-flink/fluss-flink-common`):

LakeFlinkCatalog: Adds HUDI branch in getLakeCatalog() with a new HudiCatalogFactory inner class that uses reflection to instantiate Hudi catalog (mirroring the Iceberg pattern to avoid compile-time dependency on hudi-flink-bundle).
LakeTableFactory: Adds HUDI branch in getLakeTableFactory() with getHudiFactory() that reflectively loads HoodieTableFactory.
HudiLakeStorage: Replaces the UnsupportedOperationException in createLakeCatalog() with new HudiLakeCatalog(hudiConfig) to wire the SPI path.

Key design decisions:

Aspect	Decision
Table type mapping	PK table → MERGE_ON_READ, Log table → COPY_ON_WRITE
Index strategy	BUCKET index type, aligned with Fluss's bucketing model
Dependency isolation	Hudi bundle loaded via reflection/plugin classloader (no compile-time dependency in fluss-flink-common)
Catalog mode	Supports hms (Hive Metastore) and dfs (filesystem)
Property rewriting	hudi. prefix stripped; non-hudi properties prefixed with fluss.
Idempotency	Schema-compatible duplicate creation is treated as success for crash recovery

Tests

HudiLakeCatalogTest (14 test cases):

testPropertyPrefixRewriting — verifies hudi.xxx → xxx and non-hudi → fluss.xxx prefix rewriting
testCreatePrimaryKeyTable — PK table (MOR) creation with system columns and primary key
testCreateLogTable — Log table (COW) creation with record key from customProperties
testIsHudiSchemaCompatibleWithSameSchema — compatible schemas return true
testIsHudiSchemaCompatibleWithDifferentColumnCount — different column count returns false
testIsHudiSchemaCompatibleWithDifferentColumnName — different column name returns false
testIsHudiSchemaCompatibleWithDifferentColumnType — different column type returns false
testCreateDuplicateTableWithCompatibleSchema — duplicate creation with compatible schema is idempotent
testCreateDuplicateTableWithIncompatibleSchema — duplicate creation with incompatible schema throws TableAlreadyExistException
testUnsettableOptionInPropertiesThrowsException — protected option in properties throws InvalidConfigException
testUnsettableOptionInCustomPropertiesThrowsException — protected option in customProperties throws InvalidConfigException
testNonProtectedHudiOptionPassesValidation — non-protected option passes validation
testSystemColumnBucketConflictThrowsException — __bucket conflict throws InvalidTableException
testSystemColumnOffsetConflictThrowsException — __offset conflict throws InvalidTableException
testSystemColumnTimestampConflictThrowsException — __timestamp conflict throws InvalidTableException

API and Format

No API or storage format changes. This PR only adds new implementations for the existing LakeCatalog and LakeStorage SPI interfaces.

Documentation

A new feature — Hudi lake catalog support for table creation. Will need documentation updates for the Hudi integration guide.

…Schema() and add isCreatingFlussTable in HudiLakeCatalog.createTable()

…talogDatabaseImpl in HudiLakeCatalog & use copied Configuration in HudiCatalogUtils

fhan688 · 2026-05-28T13:44:29Z

all tests are passed, please help review, thanks! @XuQianJin-Stars

XuQianJin-Stars · 2026-05-31T11:41:14Z

+        // Create table in Hudi catalog
+        try {
+            createTable(objectPath, catalogTable, context.isCreatingFlussTable());
+        } catch (DatabaseNotExistException e) {


The inner createTable throws org.apache.fluss.exception.TableAlreadyExistException (unchecked) on schema incompatibility, which is fine. The outer method signature declares throws org.apache.fluss.exception.TableAlreadyExistException — looks like it intends to comply with the LakeCatalog.createTable contract — but the exception is not actually thrown as checked via the throws clause because it's a RuntimeException subclass. OK, that's not really a problem. However:

The real bug is in the test testCreateDuplicateTableWithIncompatibleSchema: based on the implementation of isHudiSchemaCompatible(existingTable, catalogTable) (with the ResolvedCatalogBaseTable vs Schema.UnresolvedColumn branches), the catalogTable passed into hudiCatalog.createTable() is a CatalogTable.of(...) (unresolved), while hudiCatalog.getTable() may return a ResolvedCatalogBaseTable. extractColumns will go through two different code paths (one uses column.getDataType() returning DataType, the other uses UnresolvedColumn returning AbstractDataType<?>). Comparison between AbstractDataType and DataType via equals is not necessarily symmetric — even with identical column names/types it may be judged as "incompatible", and conversely different schemas may fail comparison because the AbstractDataType is not fully resolved.

Suggestion: First normalize both existingTable and expectedTable into the same representation (recommended: compare using org.apache.flink.table.types.logical.LogicalType, or convert via getUnresolvedSchema() to Schema then compare on column name + AbstractDataType). The dataType comparison should use the LogicalType.equals dimension (stripping the conversion class). Paimon's approach with isPaimonSchemaCompatible — directly comparing name+type+nullable on Paimon Schema fields — is more robust.

XuQianJin-Stars · 2026-05-31T11:43:34Z

+                }
+                // if creating a new fluss table, we should ensure the lake table is empty
+                // TODO: add emptiness check for Hudi table once LakeTieringFactory is implemented
+                if (isCreatingFlussTable) {


Compare with PaimonLakeCatalog.createTable (line 180) which directly calls checkTableIsEmpty(...). Without the emptiness check, if a user attaches a new Fluss table to a Hudi table that already contains data, data corruption/mixing will occur. Even though TieringFactory is not yet ready, at the very least the semantics should be documented (in a comment or in the thrown exception), and consider throwing TableAlreadyExistException here as a safety fallback (safer), then loosening it once TieringFactory is ready. At minimum, do not silently warn and return success.

XuQianJin-Stars · 2026-05-31T11:45:10Z

+                createTable(objectPath, catalogTable, context.isCreatingFlussTable());
+            } catch (DatabaseNotExistException t) {
+                // shouldn't happen in normal cases
+                throw new RuntimeException(


This throws a raw RuntimeException without passing t as the cause, losing the stack trace. Please change to new RuntimeException(msg, t). Similarly at line 144, the inner throw new RuntimeException(...) (schema fallback) also drops the cause (tableNotExistException) — please add it.

XuQianJin-Stars · 2026-05-31T11:47:17Z

+    public static final String FILE_SYSTEM_TYPE = "dfs";
+
+    public static Catalog createHudiCatalog(Configuration configuration) {
+        Map<String, String> hudiProps = configuration.toMap();


createHudiCatalog uses the custom MODE_CONFIG = "mode" read in HudiLakeCatalog, while buildHudiCatalog internally uses CatalogOptions.MODE.key() (also "mode") — same string by coincidence, but maintaining two sets of constants invites future drift. Standardize on CatalogOptions.MODE.key() as the authoritative key and drop the MODE_CONFIG constant to avoid long-term maintenance drift. Same for CATALOG_NAME_CONFIG = "name" — recommend using a public Hudi constant.

XuQianJin-Stars · 2026-05-31T11:49:31Z

+/** Convert from Fluss's data type to Hudi's internal data type. */
+public class FlussDataTypeToHudiDataType implements DataTypeVisitor<DataType> {
+
+    private String catalogMode;


The catalogMode field is not declared final, yet the class exposes a public constructor and provides two singleton INSTANCEs. If any caller obtains DFS_INSTANCE and accidentally invokes a future mutator, concurrency issues arise. Please change the field to private final, and either remove the public constructor or change it to private (external code uses the two INSTANCEs uniformly).

The class-level Javadoc lacks the @ThreadSafe annotation; per spec (AGENTS.md §5), multiple threads will share the same INSTANCE, so this should be explicitly marked.

XuQianJin-Stars · 2026-05-31T11:51:23Z

+        return recordKeyField;
+    }
+
+    private static void validateHudiOptions(Map<String, String> properties, boolean isPkTable) {


Logic: non-PK tables allow users to set RECORD_KEY_FIELD (combined with getRecordKeyField); PK tables forbid setting it. However, the test testUnsettableOptionInCustomPropertiesThrowsException expects a PK table passing hudi.hoodie.datasource.write.recordkey.field to fail; this rule indeed works at runtime. But the test at line 152 testCreateLogTable sets the same key on a non-PK table and expects success — that looks OK. Readability issue: the two validation scenarios are easy to confuse — suggest adding a Javadoc on validateHudiOptions explaining that RECORD_KEY_FIELD is a legal usage for non-PK tables (i.e., "log table = use the user-specified primary key field as the index field").

More importantly, after the function lets RECORD_KEY_FIELD pass, it doesn't validate that "non-PK tables must provide RECORD_KEY_FIELD"; that's done in buildHudiTableProperties via IllegalArgumentException("Record key field should be set."). IllegalArgumentException does not belong to the Fluss InvalidConfigException family (Section 4); the user sees a raw IAE. This should be changed to InvalidConfigException with helpful hints including tablePath/the specific key for easier troubleshooting.

XuQianJin-Stars · 2026-05-31T11:54:11Z

+        return columns;
+    }
+
+    private static AbstractDataType<?> getDataType(Schema.UnresolvedColumn column) {


getDataType does not handle UnresolvedComputedColumn and silently returns null, Schema.UnresolvedComputedColumn has no datatype; returning null causes ColumnSignature.equals to mistakenly consider different computed columns equal. Hudi tables shouldn't have computed columns in theory, but a safer approach is to throw IllegalStateException("Unexpected column kind: " + column.getClass()).

XuQianJin-Stars · 2026-05-31T11:57:29Z

+
+        // Add regular columns
+        for (org.apache.fluss.metadata.Schema.Column column :
+                tableDescriptor.getSchema().getColumns()) {


SYSTEM_COLUMNS (defined at HudiLakeCatalog.java L66–72) adds __bucket / __offset / __timestamp, but Hudi itself reserves _hoodie_commit_time / _hoodie_record_key / _hoodie_partition_path, etc. In HudiConversions.createHudiCatalogTable (L105–119), only collisions against Fluss system columns are checked before writing fluss columns + system columns into the schema. Suggest adding a rejection rule for the _hoodie_ prefix (or a "reserved column" set parallel to HUDI_UNSETTABLE_OPTIONS) with corresponding tests.

Copilot

Pull request overview

This PR introduces Hudi lake catalog support so Fluss can create Hudi-backed lake tables (paralleling existing Paimon/Iceberg integrations), including schema conversion, option rewriting/validation, and Flink-side reflective loading to keep fluss-flink-common free of a compile-time Hudi dependency.

Changes:

Add new fluss-lake-hudi module implementing LakeCatalog/LakeStorage for Hudi table creation (schema + properties conversions, HMS/DFS modes, idempotent create).
Extend Flink integration to support HUDI in LakeFlinkCatalog (reflective catalog creation) and LakeTableFactory (reflective table factory loading).
Add unit tests for Hudi catalog behavior (properties rewriting, schema compatibility, idempotency, protected options, system-column conflicts).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
fluss-lake/pom.xml	Adds Flink table API dependency management needed by the new Hudi module.
fluss-lake/fluss-lake-hudi/pom.xml	Introduces/adjusts Hudi + Flink/Hadoop dependencies and test utilities for the new module.
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/HudiLakeCatalog.java	New Hudi `LakeCatalog` implementation (create table, schema compatibility for idempotency, DB creation).
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/HudiLakeStorage.java	Wires `LakeStorage#createLakeCatalog()` to return the new `HudiLakeCatalog`.
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/FlussDataTypeToHudiDataType.java	Converts Fluss types to Flink/Hudi types, with mode-specific handling for LTZ timestamps.
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/utils/HudiConversions.java	Converts Fluss table descriptors to Hudi/Flink schema + options; validates protected Hudi options; rewrites property prefixes.
fluss-lake/fluss-lake-hudi/src/main/java/org/apache/fluss/lake/hudi/utils/catalog/HudiCatalogUtils.java	Builds Hudi Catalog instances for HMS/DFS modes and helpers to create catalog table wrappers.
fluss-lake/fluss-lake-hudi/src/test/java/org/apache/fluss/lake/hudi/HudiLakeCatalogTest.java	New unit test suite covering prefix rewriting, table creation, schema compatibility, idempotency, and validation.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/lake/LakeFlinkCatalog.java	Adds HUDI branch + reflection-based `HudiCatalogFactory` to load Hudi from plugin classloader.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/lake/LakeTableFactory.java	Adds HUDI branch and reflective `HoodieTableFactory` instantiation for table sources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        try {
+            Class<?> hudiFactoryClass = Class.forName("org.apache.hudi.table.HoodieTableFactory");
+            return (DynamicTableSourceFactory)
+                    hudiFactoryClass.getDeclaredConstructor().newInstance();
+        } catch (Exception e) {


+        LOG.info(
+                "create hudi catalog: {}, mode: {}, configuration: {}",
+                catalogName,
+                catalogMode,
+                configuration);


+        List<String> partitionKeys = tableDescriptor.getPartitionKeys();
+        Map<String, String> options =
+                buildHudiTableProperties(tablePath, tableDescriptor, isPkTable);
+        LOG.info("Hudi table properties: {}", options);


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

… options log to debug;[flink] load Hudi table factory with context classloader

…reate-table' into Introduce-hudi-LakeCatalog-to-create-table

fhan688 added 11 commits May 26, 2026 20:35

[lake/hudi] Introduce Hudi LakeCatalog to create table

9ab5b98

[lake/hudi] fix HIVE_META_STORE_TYPE spelling error

0a21ea0

[lake/hudi] add catalogMode in HudiConversions.convertToFlinkResolved…

c5594be

…Schema() and add isCreatingFlussTable in HudiLakeCatalog.createTable()

[lake/hudi] refine primarykey logic in HudiConversions & use flink Ca…

ca2da58

…talogDatabaseImpl in HudiLakeCatalog & use copied Configuration in HudiCatalogUtils

[lake/hudi] optimize createDatabase() impl

9a12957

[lake/hudi] add some basic tests

47dc7c5

[lake/hudi] add version info for flink-table-api-java dependency

233a8bd

[lake/hudi] fix format violations

1c7abb9

[lake/hudi] fix build errors

5d7d654

[lake/hudi] fix build errors in HudiLakeCatalogTest

05161e2

[lake/hudi] fix test failures in HudiLakeCatalogTest

f8696f1

XuQianJin-Stars reviewed May 31, 2026

View reviewed changes

fhan688 added 2 commits May 31, 2026 20:48

[lake/hudi] optimize HudiLakeCatalog

838b840

[lake/hudi] optimize HudiConversions & FlussDataTypeToHudiDataType

6ac6f0e

fhan688 closed this Jun 1, 2026

fhan688 reopened this Jun 1, 2026

fhan688 closed this Jun 1, 2026

fhan688 reopened this Jun 1, 2026

luoyuxia requested a review from Copilot June 1, 2026 08:25

Copilot started reviewing on behalf of luoyuxia June 1, 2026 08:25 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

fhan688 and others added 4 commits June 1, 2026 18:49

[lake/hudi] Support unprefixed Hudi record key option

427e886

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

[lake/hudi] avoid logging full Hudi catalog config & lower Hudi table…

9096129

… options log to debug;[flink] load Hudi table factory with context classloader

Merge remote-tracking branch 'fhan688/Introduce-hudi-LakeCatalog-to-c…

6263ce5

…reate-table' into Introduce-hudi-LakeCatalog-to-create-table

[lake/hudi] fix format violations

4fd6e3a

fhan688 closed this Jun 1, 2026

fhan688 reopened this Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lake/hudi] Introduce Hudi LakeCatalog to create table#3395

[lake/hudi] Introduce Hudi LakeCatalog to create table#3395
fhan688 wants to merge 17 commits into
apache:mainfrom
fhan688:Introduce-hudi-LakeCatalog-to-create-table

fhan688 commented May 28, 2026

Uh oh!

fhan688 commented May 28, 2026

Uh oh!

XuQianJin-Stars May 31, 2026

Uh oh!

XuQianJin-Stars May 31, 2026

Uh oh!

XuQianJin-Stars May 31, 2026

Uh oh!

XuQianJin-Stars May 31, 2026

Uh oh!

XuQianJin-Stars May 31, 2026

Uh oh!

XuQianJin-Stars May 31, 2026

Uh oh!

XuQianJin-Stars May 31, 2026

Uh oh!

XuQianJin-Stars May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fhan688 commented May 28, 2026

Purpose

Brief change log

New modules & classes (fluss-lake/fluss-lake-hudi):

Modifications to existing modules (fluss-flink/fluss-flink-common):

Tests

API and Format

Documentation

Uh oh!

fhan688 commented May 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

New modules & classes (`fluss-lake/fluss-lake-hudi`):

Modifications to existing modules (`fluss-flink/fluss-flink-common`):