Snips & Tips
Snips & Tips
  • Snips & Tips
  • 📊Data Science
    • Polars Dataframe Library
    • Loading large data
    • Pandas
      • Pandas Apply Function
    • Apache Spark
      • Custom Transformer
    • Data Visualizations
    • Jupyter Notebooks
      • Jupyter Notebook Structure
    • Probability
    • Statistics
      • Statistical Tests
      • Z - Test
      • Hypothesis Testing
    • SQL
      • SQL Tips
      • Creating new columns
  • ☘️Deep Learning
    • Backpropagation in Deep Learning
    • Pytorch Early Stopping
    • Optimizers
  • Pytorch Tensor Shapes
  • 🔖Machine Learning
    • Handling Imbalanced Dataset
    • Time Series Forecasting
      • Hierarchical Time Series Forecasting
      • Facebook Prophet
      • Misc
    • Handling high dimensionality data
      • Weight of evidence and Information value
    • Debugging ML Models
    • Feature Engineering
      • Time Series
      • Outlier Detection
      • Categorical Encoding
      • Feature Scaling
  • 🐲DSA
    • Arrays
  • 🖥️WEB DEV
    • Typescript
    • React State Management
    • Redux Boilerplate
    • Intercept a HTTP request or response
    • this keyword
    • Array Methods
    • Throttle Debounce
    • Media Queries
    • React Typeahead Search
  • Replace text with React Component
  • 💻Product Analytics
    • Product Sense
    • Customer Segmentation
  • 🖥️Terminal
    • Terminal Commands
    • Jupyter Notebook 2 HTML
  • 🪛Tools and Libraries
    • Web Based
    • Databases
  • 🚟Backend
    • Fast API CRUD
    • Scalable APIs
  • 💸Quant Finance
    • Misc
    • Factor Investing
  • 🎮Game Dev
    • Misc
  • 🛠️Architecture
    • Docker
    • AWS CDK
  • 🦠Artificial Intelligence
    • AI Engg
Powered by GitBook
On this page
  • Target Encoding
  • Label Encoding
  • One Hot Encoder
  • Feature Binning
  • Time and Date Transformation

Was this helpful?

Edit on GitHub
  1. Machine Learning
  2. Feature Engineering

Categorical Encoding

Target Encoding

  • High-cardinality features: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature's most important property: its relationship with the target.

  • Domain-motivated features: From prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature's true informativeness.

from sklearn.preprocessing import TargetEncoder

Label Encoding

Can be used for features with 2 categories.

One Hot Encoder

Good when there are more than 2 categories but will result in a sparse matrix when there are a large number of categories

Feature Binning

When there are a lot of features like above 20, we can try binning them into 2 or 3 categories. for example: different car models can be binned into sports, sedans, and luxury depending on the problem statement.

Time and Date Transformation

df['hr_sin'] = np.sin(df.hr*(2.*np.pi/24))
df['hr_cos'] = np.cos(df.hr*(2.*np.pi/24))
df['mnth_sin'] = np.sin((df.mnth-1)*(2.*np.pi/12))
df['mnth_cos'] = np.cos((df.mnth-1)*(2.*np.pi/12))
PreviousOutlier DetectionNextFeature Scaling

Last updated 2 months ago

Was this helpful?

🔖