Snips & Tips
Snips & Tips
  • Snips & Tips
  • 📊Data Science
    • Polars Dataframe Library
    • Loading large data
    • Pandas
      • Pandas Apply Function
    • Apache Spark
      • Custom Transformer
    • Data Visualizations
    • Jupyter Notebooks
      • Jupyter Notebook Structure
    • Probability
    • Statistics
      • Statistical Tests
      • Z - Test
      • Hypothesis Testing
    • SQL
      • SQL Tips
      • Creating new columns
  • ☘️Deep Learning
    • Backpropagation in Deep Learning
    • Pytorch Early Stopping
    • Optimizers
  • Pytorch Tensor Shapes
  • 🔖Machine Learning
    • Handling Imbalanced Dataset
    • Time Series Forecasting
      • Hierarchical Time Series Forecasting
      • Facebook Prophet
      • Misc
    • Handling high dimensionality data
      • Weight of evidence and Information value
    • Debugging ML Models
    • Feature Engineering
      • Time Series
      • Outlier Detection
      • Categorical Encoding
      • Feature Scaling
  • 🐲DSA
    • Arrays
  • 🖥️WEB DEV
    • Typescript
    • React State Management
    • Redux Boilerplate
    • Intercept a HTTP request or response
    • this keyword
    • Array Methods
    • Throttle Debounce
    • Media Queries
    • React Typeahead Search
  • Replace text with React Component
  • 💻Product Analytics
    • Product Sense
    • Customer Segmentation
  • 🖥️Terminal
    • Terminal Commands
    • Jupyter Notebook 2 HTML
  • 🪛Tools and Libraries
    • Web Based
    • Databases
  • 🚟Backend
    • Fast API CRUD
    • Scalable APIs
  • 💸Quant Finance
    • Misc
    • Factor Investing
  • 🎮Game Dev
    • Misc
  • 🛠️Architecture
    • Docker
    • AWS CDK
  • 🦠Artificial Intelligence
    • AI Engg
Powered by GitBook
On this page

Was this helpful?

Edit on GitHub
  1. Data Science

Pandas

Plotting

To plot histogram from dataframe

df[anything].hist(bins =50)

If you the plot to be separated by categorical variables. for eg final_close_flag has values 0 and 1

df.hist(column="record_number", by="final_close_flag", bins=20)

Box plot

df.boxplot(column='name')

Segregation by features

df.boxplot(column='name', by='feature')

To use plotly as backend to plot directly from pandas dataframe

import pandas as pd
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'

df["Close"].plot()

To plot candlestick chart using plotly

fig = go.Figure(data=[go.Candlestick(x=df.index, open=df["Open"],
                high=df["High"], low=df["Low"], close=df["Close"])])
fig.show()

Links

Tips

List missing values of each column

df.apply(lambda x :sum(x.isnull()), axis=0)
df.isnull().sum

Boolean Indexing in Pandas

data.loc[(data["Gender"]=="Female") & (data["Education"]=="Not Graduate") & (data["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]

Cut function for binning

Sometimes numerical values make more sense if clustered together. For example, if we’re trying to model traffic (#cars on road) with time of the day (minutes). The exact minute of an hour might not be that relevant for predicting traffic as compared to actual period of the day like “Morning”, “Afternoon”, “Evening”, “Night”, “Late Night”. Modeling traffic this way will be more intuitive and will avoid overfitting.

Here we define a simple function which can be re-used for binning any variable fairly easily.

#Binning:
def binning(col, cut_points, labels=None):
  #Define min and max values:
  minval = col.min()
  maxval = col.max()

  #create list by adding min and max to cut_points
  break_points = [minval] + cut_points + [maxval]

  #if no labels provided, use default labels 0 ... (n-1)
  if not labels:
    labels = range(len(cut_points)+1)

  #Binning using cut function of pandas
  colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
  return colBin

#Binning age:
cut_points = [90,140,190]
labels = ["low","medium","high","very high"]
data["LoanAmount_Bin"] = binning(data["LoanAmount"], cut_points, labels)
print (pd.value_counts(data["LoanAmount_Bin"], sort=False))

Coding nominal data using Pandas

Often, we find a case where we’ve to modify the categories of a nominal variable. This can be due to various reasons:

Some algorithms (like Logistic Regression) require all inputs to be numeric. So nominal variables are mostly coded as 0, 1….(n-1) Sometimes a category might be represented in 2 ways. For e.g. temperature might be recorded as “High”, “Medium”, “Low”, “H”, “low”. Here, both “High” and “H” refer to same category. Similarly, in “Low” and “low” there is only a difference of case. But, python would read them as different levels. Some categories might have very low frequencies and its generally a good idea to combine them. Here I’ve defined a generic function which takes in input as a dictionary and codes the values using ‘replace’ function in Pandas.

#Define a generic function using Pandas replace function
def coding(col, codeDict):
  colCoded = pd.Series(col, copy=True)
  for key, value in codeDict.items():
    colCoded.replace(key, value, inplace=True)
  return colCoded

#Coding LoanStatus as Y=1, N=0:
print 'Before Coding:'
print pd.value_counts(data["Loan_Status"])
data["Loan_Status_Coded"] = coding(data["Loan_Status"], {'N':0,'Y':1})
print '\nAfter Coding:'
print pd.value_counts(data["Loan_Status_Coded"])

Show missing values

def show_missing_count(data):
    miss_percent = (data.isnull().sum() / len(data)) * 100
    missing = pd.DataFrame({"Percent":miss_percent, 'Count':data.isnull().sum()}).sort_values(by="Percent", ascending=False)
    missing=missing.loc[missing['Percent'] > 0]
    return missing

Check if there are empty rows or columns and find them

import pandas as pd

df = pd.DataFrame({
  "Accessories": ["Laptop", "Laptop", "Ipad", pd.NA, "Tablet", "Laptop"],
  "customer": ["Andrew", "Andrew", "Tom", "Andrew", "Tobey", pd.NA],
  "quantity": [1, 2, 2, 3, 1, 2],
})

df[df.isnull().any(axis=1)]

Output

index
Accessories
customer
quantity

3

Andrew

3

5

Laptop

2

Here .any() select a row if one of the columns is 'NA' . Internally a numpy mask df.isnull().any(axis=1) is created which outputs following

0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool

List all individual values from categorical columns and their count

df[categorical_columns].melt().groupby(['variable', 'value'])[['value']].count()

Bubble plot

Dataframe

Country
Year
Exports

Country O

2008

28.688524590163933

Country O

2004

25.68807339449541

Country O

1996

5.405405405405405

Country O

2002

17.094017094017094

Country O

1992

18.72791519434629

Plotly Version

import pandas as pd
import plotly.express as px
df = df.sort_values("Country", ascending = False)

fig = px.scatter(
    df, y='Country', x='Year', color='Exports', size='Exports', size_max=20,
    color_continuous_scale = px.colors.sequential.RdBu,
)
fig.update_layout(
        paper_bgcolor="white",
        plot_bgcolor="white",
    )
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGray')
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='LightGray')
fig.update_layout(height=500, width=1000)
fig.update_coloraxes(colorbar=dict(title='Exports'))
fig.update_traces(marker=dict(sizeref=0.09))
fig.update_yaxes(title="Country")
fig.update_xaxes(title='Year')
fig.update_layout(showlegend=False)
fig.show()

SeaBorn Version

plt.figure(figsize=(15, 5))
sns.scatterplot(data=df, x='Year', y='Country', size='Exports', sizes=(20, 200), hue='Exports');
plt.title('Exports by Country and Year')
plt.show()
PreviousLoading large dataNextPandas Apply Function

Last updated 9 months ago

Was this helpful?

📊
https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
https://www.analyticsvidhya.com/blog/2015/09/hypothesis-testing-explained/