MDEV-39499 Updates to derived-with-keys, window functions determining#5252
MDEV-39499 Updates to derived-with-keys, window functions determining#5252mariadb-RexJohnston wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces cardinality estimation for derived tables with window functions (specifically ROW_NUMBER) by adding opt_window_function_cardinality.cc and integrating it into temporary key creation. Feedback on these changes highlights critical compilation errors due to type mismatches between ulong* and rec_per_key_t* (which is double*) in the new estimation function. Additionally, several high-severity issues were identified, including potential null pointer dereferences when resolving fields and a potential buffer overflow in TABLE::add_tmp_key if key_parts exceeds MAX_REF_PARTS.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
..records per key Enabling derived keys optimization for derived.col = const pushed conditions. Estimating records per key in derived key for the optimizer based on form and/or size of components of a derived table. Consider any derived table of the form SELECT ..., ROW_NUMBER () OVER (PARTITION BY c1,c2 order by ...) FROM t1, t2, t3 ... WHERE ... If the optimizer generates a key on this derived table because of a constraint being pushed into it, it currently will not consider key components of the form col = const We lift this constraint and add code to TABLE::add_tmp_key to search for a window function ROW_NUMBER(). From the partition list c1, c2 we can in infer an estimate of the number of rows we expect to see for each key value. In the absence of EITS, we still have an number of expected records in our referred to table and will fall back to using that as a worst case scenario. As a consequence of forcing best_access_path to now consider a generated index's records per key on any type of derived table, we see a number of optimization changes, for example a range optimizer might now be considered best and might return IMPOSSIBLE_RANGE (seen in rowid_filter_myisam.result).
c4c2f7d to
f824984
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to estimate the cardinality of derived tables with window functions (specifically ROW_NUMBER() OVER (PARTITION BY ...)), allowing the optimizer to better determine records per key. It adds opt_window_function_cardinality.cc and opt_window_function_cardinality.h, integrates them into the build files and TABLE::add_tmp_key, and includes extensive test coverage. The review feedback highlights several critical improvements: 1) In handle_single_part_rownumber, HA_POS_ERROR must be handled safely to prevent massive values from casting, and multi-column partition estimates should be capped by the table's total records. 2) In est_derived_window_fn_cardinality, comparing result_field directly with the key field is a much more robust heuristic than using select list indices, and rec_per_key_t should be used instead of ulong for type safety. 3) A defensive assertion should be added in TABLE::add_tmp_key to prevent potential stack buffer overflows.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| static ha_rows handle_single_part_rownumber( Item_window_func *item_row_number) | ||
| { | ||
| if (item_row_number->window_spec->partition_list && | ||
| item_row_number->window_spec->partition_list->elements) | ||
| { | ||
| double records= 1; | ||
| ORDER *part_list= item_row_number->window_spec->partition_list->first; | ||
| for (; part_list; part_list= part_list->next) | ||
| { | ||
| if ((*part_list->item)->type() == Item::FIELD_ITEM) | ||
| { | ||
| Item_field *item_field= ((Item_field*)(*part_list->item)); | ||
| DBUG_ASSERT(item_field); | ||
| Field *field= item_field->field; | ||
| DBUG_ASSERT(field); | ||
| /* | ||
| use EITS records per key if we have them, | ||
| otherwise | ||
| use number of records in the table (as a worst case scenario) | ||
| */ | ||
| if (field->read_stats) | ||
| { | ||
| // records*= field->read_stats->get_avg_frequency(); | ||
| double freq= field->read_stats->get_avg_frequency(); | ||
| records*= freq > 0 ? field->table->used_stat_records / freq | ||
| : (double)field->table->used_stat_records; | ||
| } | ||
| else | ||
| records*= (double)field->table->used_stat_records; | ||
| } | ||
| } | ||
| return records < 1.0 ? (ha_rows)1 : (ha_rows)records; | ||
| } | ||
| else | ||
| { | ||
| return (ha_rows)1; | ||
| } | ||
| } |
There was a problem hiding this comment.
The current implementation of handle_single_part_rownumber has two significant issues:
- Handling of
HA_POS_ERROR: Iftable->used_stat_recordsisHA_POS_ERROR(indicating unknown/uninitialized statistics), casting it todoublewill result in a huge value (approx1.8e19), which will completely distort the cardinality estimation and potentially cause overflows. - Exaggerated Estimates for Multi-column Partitions: When partitioning by multiple columns of the same table, multiplying their distinct values can easily exceed the total number of records in the table. The accumulated distinct values for each table should be capped by that table's total records.
We can resolve both issues by tracking distinct values per table and capping them, while safely falling back to transient engine statistics or a default value if used_stat_records is HA_POS_ERROR.
static ha_rows handle_single_part_rownumber(Item_window_func *item_row_number)
{
if (item_row_number->window_spec->partition_list &&
item_row_number->window_spec->partition_list->elements)
{
double records= 1;
const uint MAX_TABLES= 16;
TABLE *tables[MAX_TABLES];
double table_distinct[MAX_TABLES];
uint table_count= 0;
ORDER *part_list= item_row_number->window_spec->partition_list->first;
for (; part_list; part_list= part_list->next)
{
if ((*part_list->item)->type() == Item::FIELD_ITEM)
{
Item_field *item_field= ((Item_field*)(*part_list->item));
DBUG_ASSERT(item_field);
Field *field= item_field->field;
DBUG_ASSERT(field);
TABLE *table= field->table;
ha_rows table_records= table->used_stat_records;
if (table_records == HA_POS_ERROR)
{
table_records= table->file->stats.records;
if (table_records == HA_POS_ERROR)
table_records= 1000; // Fallback default
}
double diff_values= 1.0;
if (field->read_stats)
{
double freq= field->read_stats->get_avg_frequency();
diff_values= freq > 0 ? (double)table_records / freq
: (double)table_records;
}
else
{
diff_values= (double)table_records;
}
uint idx= 0;
for (; idx < table_count; idx++)
{
if (tables[idx] == table)
break;
}
if (idx < MAX_TABLES)
{
if (idx == table_count)
{
tables[idx]= table;
table_distinct[idx]= diff_values;
table_count++;
}
else
{
table_distinct[idx]*= diff_values;
}
if (table_distinct[idx] > (double)table_records)
table_distinct[idx]= (double)table_records;
}
}
}
for (uint i= 0; i < table_count; i++)
{
records*= table_distinct[i];
}
return records < 1.0 ? (ha_rows)1 : (ha_rows)records;
}
else
{
return (ha_rows)1;
}
}| bool est_derived_window_fn_cardinality(st_select_lex* derived, | ||
| ulong *out_records, | ||
| Field **reg_fields, | ||
| uint key_parts) | ||
| { | ||
| uint fc= 0; | ||
|
|
||
| while (fc < key_parts) | ||
| { | ||
| uint ic= 0; | ||
| if (!reg_fields[fc]) | ||
| { | ||
| fc++; | ||
| continue; | ||
| } | ||
| uint field_index= reg_fields[fc]->field_index; | ||
| List_iterator_fast<Item> li(*derived->get_item_list()); | ||
| Item *item; | ||
|
|
||
| // find our window function in the item list | ||
| while ((item= li++)) | ||
| { | ||
| // if the single part key is on our window function | ||
| if ((ic == field_index) && item->type() == Item::WINDOW_FUNC_ITEM) | ||
| { | ||
| Item_window_func *itm_row_num= (Item_window_func*)item; | ||
|
|
||
| if (itm_row_num->window_func()->sum_func() == Item_sum::ROW_NUMBER_FUNC) | ||
| { | ||
| *out_records= (ulong)handle_single_part_rownumber(itm_row_num); | ||
| return true; | ||
| } | ||
| } | ||
| ic++; | ||
| } | ||
| fc++; | ||
| } | ||
|
|
||
| return false; | ||
| } |
There was a problem hiding this comment.
The current implementation of est_derived_window_fn_cardinality relies on comparing the select list index ic with the temporary table's field_index. This index-based heuristic is fragile and can break if fields are optimized away, reordered, or if hidden fields are present in the temporary table.
Instead, we can directly compare the result_field of the Item with the key field reg_fields[fc]. This is 100% robust and standard in the optimizer. Additionally, we should use rec_per_key_t instead of ulong for out_records to ensure type safety and future-proofing.
bool est_derived_window_fn_cardinality(st_select_lex* derived,
rec_per_key_t *out_records,
Field **reg_fields,
uint key_parts)
{
uint fc= 0;
while (fc < key_parts)
{
if (!reg_fields[fc])
{
fc++;
continue;
}
List_iterator_fast<Item> li(*derived->get_item_list());
Item *item;
// find our window function in the item list
while ((item= li++))
{
// Compare the result_field of the item directly with the key field
if (item->result_field == reg_fields[fc] && item->type() == Item::WINDOW_FUNC_ITEM)
{
Item_window_func *itm_row_num= (Item_window_func*)item;
if (itm_row_num->window_func()->sum_func() == Item_sum::ROW_NUMBER_FUNC)
{
*out_records= (rec_per_key_t)handle_single_part_rownumber(itm_row_num);
return true;
}
}
}
fc++;
}
return false;
}| bool est_derived_window_fn_cardinality(st_select_lex* derived, | ||
| ulong *out_records, | ||
| Field **reg_fields, | ||
| uint key_parts); |
There was a problem hiding this comment.
Update the function signature to use rec_per_key_t instead of ulong for the out_records parameter to maintain type safety and consistency with the optimizer's rec_per_key array.
| bool est_derived_window_fn_cardinality(st_select_lex* derived, | |
| ulong *out_records, | |
| Field **reg_fields, | |
| uint key_parts); | |
| bool est_derived_window_fn_cardinality(st_select_lex* derived, | |
| rec_per_key_t *out_records, | |
| Field **reg_fields, | |
| uint key_parts); |
| char buf[NAME_CHAR_LEN]; | ||
| KEY *keyinfo= key_info + key; | ||
| KEY_PART_INFO *key_part_info; | ||
| Field **reg_field; | ||
| Field *reg_field[MAX_REF_PARTS]; | ||
| uint i; | ||
| bool key_start= TRUE; |
There was a problem hiding this comment.
Since reg_field is now declared as a fixed-size stack array of size MAX_REF_PARTS, we should add a defensive assertion DBUG_ASSERT(key_parts <= MAX_REF_PARTS); at the beginning of TABLE::add_tmp_key to prevent any potential stack buffer overflow if key_parts exceeds MAX_REF_PARTS.
char buf[NAME_CHAR_LEN];
KEY *keyinfo= key_info + key;
KEY_PART_INFO *key_part_info;
Field *reg_field[MAX_REF_PARTS];
uint i;
bool key_start= TRUE;
DBUG_ASSERT(key_parts <= MAX_REF_PARTS);
..records per key
Enabling derived keys optimization for derived.col = const pushed conditions.
Estimating records per key in derived key for the optimizer based on form and/or size of components of a derived table.
Consider any derived table of the form
SELECT ..., ROW_NUMBER () OVER (PARTITION BY c1,c2 order by ...) FROM t1, t2, t3 ...
WHERE ...
If the optimizer generates a key on this derived table because of a constraint being pushed into it, it currently will not consider key components of the form col = const
We lift this constraint and add code to TABLE::add_tmp_key to search for a window function ROW_NUMBER(). From the partition list c1, c2 we can in infer an estimate of the number of rows we expect to see for each key value. In the absence of EITS, we still have an number of expected records in our referred to table and will fall back to using that as a worst case scenario.
As a consequence of forcing best_access_path to now consider a generated index's records per key on any type of derived table, we see a number of optimization changes, for example a range optimizer might now be considered best and might return IMPOSSIBLE_RANGE (seen in rowid_filter_myisam.result).