Yet more methods of converting a pandas.DataFrame to numpy.array while preserving label/column names
This is mainly for demonstrating how to set dtype/column_dtypes, because sometimes a data source iterator's output'll need some pre-normalization.
Method one inserts by column into a zeroed array of predefined height and is loosely based on a Creating Structured Arrays guide that just a bit of web-crawling turned up
import numpy
def to_tensor(dataframe, columns = [], dtypes = {}):
# Use all columns from data frame if none where listed when called
if len(columns) <= 0:
columns = dataframe.columns
# Build list of dtypes to use, updating from any `dtypes` passed when called
dtype_list = []
for column in columns:
if column not in dtypes.keys():
dtype_list.append(dataframe[column].dtype)
else:
dtype_list.append(dtypes[column])
# Build dictionary with lists of column names and formatting in the same order
dtype_dict = {
'names': columns,
'formats': dtype_list
}
# Initialize _mostly_ empty nupy array with column names and formatting
numpy_buffer = numpy.zeros(
shape = len(dataframe),
dtype = dtype_dict)
# Insert values from dataframe columns into numpy labels
for column in columns:
numpy_buffer[column] = dataframe[column].to_numpy()
# Return results of conversion
return numpy_buffer
Method two is based on user7138814's answer and will likely be more efficient as it is basically a wrapper for the built in to_records method available to pandas.DataFrames
def to_tensor(dataframe, columns = [], dtypes = {}, index = False):
to_records_kwargs = {'index': index}
if not columns: # Default to all `dataframe.columns`
columns = dataframe.columns
if dtypes: # Pull in modifications only for dtypes listed in `columns`
to_records_kwargs['column_dtypes'] = {}
for column in dtypes.keys():
if column in columns:
to_records_kwargs['column_dtypes'].update({column: dtypes.get(column)})
return dataframe[columns].to_records(**to_records_kwargs)
With either of the above one could do...
X = pandas.DataFrame(dict(age = [40., 50., 60.], sys_blood_pressure = [140., 150., 160.]))
# Example of overwriting dtype for a column
X_tensor = to_tensor(X, dtypes = {'age': 'int32'})
print("Ages -> {0}".format(X_tensor['age']))
print("SBPs -> {0}".format(X_tensor['sys_blood_pressure']))
... which should output...
Ages -> array([40, 50, 60])
SBPs -> array([140., 150., 160.])
... and a full dump of X_tensor should look like the following.
array([(40, 140.), (50, 150.), (60, 160.)],
dtype=[('age', '<i4'), ('sys_blood_pressure', '<f8')])
Some thoughts
While method two will likely be more efficient than the first, method one (with some modifications) may be more useful for merging two or more pandas.DataFrames into one numpy.array.
Additionally (after swinging back through to review), method one will likely face-plant as it's written with errors about to_records_kwargs not being a mapping if dtypes is not defined, next time I'm feeling Pythonic I may resolve that with an else condition.