Data Processing with Optimus by Dr. Argenis Leon & Luis Aguirre

Data Processing with Optimus by Dr. Argenis Leon & Luis Aguirre

Author:Dr. Argenis Leon & Luis Aguirre [Dr. Argenis Leon]
Language: eng
Format: epub
Publisher: Packt Publishing
Published: 2021-09-02T16:00:00+00:00


For a more general insight into the data, you can ask for a complete profile of the dataset. Let's check that out.

Data profiling

There is a handy function in Optimus called profile that returns useful stats about our dataset. Let's see how to use it:

df.profile(bins=5)

This code will return a dictionary:

{'columns': {'id': {'stats': {'match': 504,

'missing': 0,

'mismatch': 0,

'profiler_dtype': {'dtype': 'int', 'categorical': True},

'frequency': [{'value': 1, 'count': 1},

{'value': 332, 'count': 1},

{'value': 345, 'count': 1},

{'value': 344, 'count': 1},

{'value': 343, 'count': 1}],

'count_uniques': 504},

'dtype': 'int64'},

'name': {'stats': {'match': 504,

'missing': 0,

'mismatch': 0,

'profiler_dtype': {'dtype': 'str', 'categorical': True},

'frequency': [{'value': 'pants', 'count': 254},

{'value': 'shoes', 'count': 134},

{'value': 'shirt', 'count': 116}],

'count_uniques': 3},

'dtype': 'object'},

'code': {'stats': {'match': 504,

'missing': 0,

'mismatch': 0,

'profiler_dtype': {'dtype': 'str', 'categorical': True},

'frequency': [{'value': 'JG15', 'count': 60},

{'value': 'JG10', 'count': 43},

{'value': 'SK', 'count': 37},

{'value': 'L15', 'count': 33},

{'value': 'J15', 'count': 32}],

'count_uniques': 39},

'dtype': 'object'},

'price': {'stats': {'match': 504,

'missing': 0,

'mismatch': 0,

'profiler_dtype': {'dtype': 'decimal', 'categorical':

False},

'hist': [{'lower': 5.0, 'upper': 103.3675, 'count': 250},

{'lower': 103.3675, 'upper': 201.735, 'count': 179},

{'lower': 201.735, 'upper': 300.1025, 'count': 39},

{'lower': 300.1025, 'upper': 398.47, 'count': 36}]},

'dtype': 'float64'},

'discount': {'stats': {'match': 294,

'missing': 0,

'mismatch': 210,

'profiler_dtype': {'dtype': 'int', 'categorical': True},

'frequency': [{'value': '0', 'count': 294},

{'value': '5%', 'count': 65},

{'value': '20%', 'count': 63},

{'value': '15%', 'count': 54},

{'value': '50%', 'count': 16}],

'count_uniques': 6},

'dtype': 'object'}},

'name': 'store.csv',

'file_name': ['store.csv'],

'summary': {'cols_count': 5,

'rows_count': 504,

'dtypes_list': ['float64', 'int64', 'object'],

'total_count_dtypes': 3,

'missing_count': 0,

'p_missing': 0.0}

}

With this Python dictionary, you can get info about specific columns and stats about the whole dataframe.

For dataframe stats, you can use profile.summary() to get the following:

cols_count: Number columns in the dataframe

rows_count: Number of rows in the dataframe

dtypes_list: List of dtypes in the dataframe

total_count_dtypes: Count of data types in the dataframe

missing_count: Number of missing values in the dataframe

p_missing: Percentage of missing values in the dataframe



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.