Consider the student exam scores scenario which contains some outliers
IQR method
This method identifies outliers based on the quartiles of the data.
It is more resistant to outliers than the mean-based methods like Z-score, making it better suited for skewed data like exam scores where a few high scores can significantly impact the mean.
Z-score method
This method identifies outliers based on their standard deviation from the mean.
However, it can be influenced by outliers itself, leading to inaccurate outlier detection when the data is skewed.
In this scenario, the presence of a few high scores can inflate the standard deviation, potentially masking other outliers or misidentifying valid data points as outliers.
# outlier treatmentdefremove_outliers_zscore(df,threshold=2): #(considering 2 std.dev away from mean approx 95% of data)""" Remove outliers from a DataFrame using the Z-score method. Parameters: df (DataFrame): The input DataFrame. threshold (float): The Z-score threshold for identifying outliers. Observations with a Z-score greater than this threshold will be considered as outliers. Returns: DataFrame: The DataFrame with outliers removed. """# Calculate Z-scores for numerical columns z_scores = (df[numerical_cols]- df[numerical_cols].mean()) / df[numerical_cols].std()# Identify outliers outliers = np.abs(z_scores)> threshold# Keep non-outliers for numerical columns df_cleaned = df[~outliers.any(axis=1)]return df_cleanedcleaned_df =remove_outliers_zscore(df1)print(cleaned_df.shape)
defclip_outliers_zscore(df,threshold=2):""" Clip outliers in a DataFrame using the Z-score method. Parameters: df (DataFrame): The input DataFrame. threshold (float): The Z-score threshold for identifying outliers. Observations with a Z-score greater than this threshold will be considered as outliers. Returns: DataFrame: The DataFrame with outliers clipped. """# Calculate Z-scores for numerical columns z_scores = (df[numerical_cols]- df[numerical_cols].mean()) / df[numerical_cols].std()# Clip outliers clipped_values = df[numerical_cols].clip(df[numerical_cols].mean() - threshold * df[numerical_cols].std(), df[numerical_cols].mean() + threshold * df[numerical_cols].std(), axis=1)# Assign clipped values to original DataFrame df_clipped = df.copy() df_clipped[numerical_cols]= clipped_valuesreturn df_clippedclipped_df =clip_outliers_zscore(df1)print(clipped_df.shape)