Skip to content

load_feature_definitions_from_dataframe() doesn't recognize pandas nullable dtypes (Float64, Int64) #5675

@brifordwylie

Description

@brifordwylie

PySDK Version
PySDK 3.6.0

Describe the bug
load_feature_definitions_from_dataframe() in sagemaker.mlops.feature_store only recognizes numpy dtypes (float64, int64, etc.) but not pandas nullable dtypes (Float64, Int64, string). When a DataFrame uses nullable dtypes (common after calling pd.DataFrame.convert_dtypes()), all numeric columns are incorrectly mapped to StringFeatureDefinition.

To reproduce

import pandas as pd
from sagemaker.mlops.feature_store import load_feature_definitions_from_dataframe

# Create a DataFrame with numpy dtypes (works correctly)
df_numpy = pd.DataFrame({
    "id": [1, 2, 3],
    "price": [1.1, 2.2, 3.3],
    "name": ["a", "b", "c"],
})
print("numpy dtypes:", {c: str(df_numpy[c].dtype) for c in df_numpy.columns})
# {'id': 'int64', 'price': 'float64', 'name': 'object'}

defs = load_feature_definitions_from_dataframe(df_numpy)
for d in defs:
    print(f"  {d.feature_name}: {d.feature_type}")

# Now convert to pandas nullable dtypes (common pattern)
df_nullable = df_numpy.convert_dtypes()
print("\nnullable dtypes:", {c: str(df_nullable[c].dtype) for c in df_nullable.columns})
# {'id': 'Int64', 'price': 'Float64', 'name': 'string'}

defs = load_feature_definitions_from_dataframe(df_nullable)
for d in defs:
    print(f"  {d.feature_name}: {d.feature_type}")

Root cause

In sagemaker/mlops/feature_store/feature_utils.py, _INTEGER_TYPES and _FLOAT_TYPES only contain lowercase numpy dtype names:

INTEGER_TYPES = {'int8', 'int16', 'int32', 'int64', 'int', 'uint8', 'uint16', 'uint32', 'uint64'}
FLOAT_TYPES = {'float16', 'float32', 'float64', 'float'}

Pandas nullable dtypes are capitalized (Int64, Float64, etc.) and are not matched.

Suggested fix

Add nullable dtype names to the type sets:

INTEGER_TYPES = {'int8', 'int16', 'int32', 'int64', 'int',
'Int8', 'Int16', 'Int32', 'Int64',
'uint8', 'uint16', 'uint32', 'uint64',
'UInt8', 'UInt16', 'UInt32', 'UInt64'}
FLOAT_TYPES = {'float16', 'float32', 'float64', 'float',
'Float16', 'Float32', 'Float64'}

Or use case-insensitive comparison in _generate_feature_definition().

Expected behavior
Panda nullable types should get properly converted.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 3.6.0

I think this got fixed/addressed before.. but maybe that 2.x code didn't carray over to 3.x
https://github.com/aws/sagemaker-python-sdk/pull/3740/changes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions