Consider the student exam scores scenario which contains some outliers
IQR method
This method identifies outliers based on the quartiles of the data.
It is more resistant to outliers than the mean-based methods like Z-score, making it better suited for skewed data like exam scores where a few high scores can significantly impact the mean.
Z-score method
This method identifies outliers based on their standard deviation from the mean.
However, it can be influenced by outliers itself, leading to inaccurate outlier detection when the data is skewed.
In this scenario, the presence of a few high scores can inflate the standard deviation, potentially masking other outliers or misidentifying valid data points as outliers.
# outlier treatment
def remove_outliers_zscore(df, threshold=2): #(considering 2 std.dev away from mean approx 95% of data)
"""
Remove outliers from a DataFrame using the Z-score method.
Parameters:
df (DataFrame): The input DataFrame.
threshold (float): The Z-score threshold for identifying outliers.
Observations with a Z-score greater than this threshold
will be considered as outliers.
Returns:
DataFrame: The DataFrame with outliers removed.
"""
# Calculate Z-scores for numerical columns
z_scores = (df[numerical_cols] - df[numerical_cols].mean()) / df[numerical_cols].std()
# Identify outliers
outliers = np.abs(z_scores) > threshold
# Keep non-outliers for numerical columns
df_cleaned = df[~outliers.any(axis=1)]
return df_cleaned
cleaned_df = remove_outliers_zscore(df1)
print(cleaned_df.shape)
def clip_outliers_zscore(df, threshold=2):
"""
Clip outliers in a DataFrame using the Z-score method.
Parameters:
df (DataFrame): The input DataFrame.
threshold (float): The Z-score threshold for identifying outliers.
Observations with a Z-score greater than this threshold
will be considered as outliers.
Returns:
DataFrame: The DataFrame with outliers clipped.
"""
# Calculate Z-scores for numerical columns
z_scores = (df[numerical_cols] - df[numerical_cols].mean()) / df[numerical_cols].std()
# Clip outliers
clipped_values = df[numerical_cols].clip(df[numerical_cols].mean() - threshold * df[numerical_cols].std(),
df[numerical_cols].mean() + threshold * df[numerical_cols].std(),
axis=1)
# Assign clipped values to original DataFrame
df_clipped = df.copy()
df_clipped[numerical_cols] = clipped_values
return df_clipped
clipped_df = clip_outliers_zscore(df1)
print(clipped_df.shape)