Encoding categorical variables

Encoding Categorical Variables in Scikit Learn

In machine learning, it is common to encounter datasets with categorical variables, which represent data that can take on a limited set of discrete values. However, most machine learning algorithms are designed to work with numerical data. Therefore, it becomes necessary to convert categorical variables into a numerical representation before applying machine learning techniques.

Fortunately, scikit-learn provides several techniques to encode categorical variables, ensuring compatibility with machine learning algorithms. Let's explore some of these techniques:

1. Label Encoding

Label Encoding assigns a unique numerical label to each unique category in a categorical variable. This technique is suitable for ordinal variables, where the order of the categories is important. Scikit-learn's LabelEncoder class can be used to perform label encoding.

Here's an example illustrating how to use LabelEncoder:

from sklearn.preprocessing import LabelEncoder

# Create an instance of LabelEncoder
encoder = LabelEncoder()

# Fit the encoder on a categorical variable
encoder.fit(["red", "green", "blue", "red", "green", "green"])

# Transform the categorical variable into numerical labels
encoded_labels = encoder.transform(["red", "green", "blue"])

print(encoded_labels)
# Output: [2, 1, 0]

2. One-Hot Encoding

One-Hot Encoding is suitable for nominal variables, where categories have no intrinsic order. It creates binary columns for each unique category, representing the presence or absence of that category in a given data point. Scikit-learn's OneHotEncoder class can be used to perform one-hot encoding.

Let's take a look at an example using OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Create an instance of OneHotEncoder
encoder = OneHotEncoder()

# Fit the encoder on a categorical variable
encoder.fit([["red"], ["green"], ["blue"]])

# Transform the categorical variable into one-hot encoded representation
one_hot_encoded = encoder.transform([["red"], ["green"], ["blue"]])

# Convert sparse matrix representation to a regular array
one_hot_encoded = one_hot_encoded.toarray()

print(one_hot_encoded)
# Output: 
# [[1. 0. 0.]
#  [0. 1. 0.]
#  [0. 0. 1.]]

3. Binary Encoding

Binary Encoding represents each category with a binary code. It is particularly efficient when dealing with high cardinality categorical variables. Scikit-learn does not provide an implementation for binary encoding directly, but libraries like category_encoders can be used.

Here's an example using category_encoders:

import category_encoders as ce
import pandas as pd

# Create a DataFrame with a categorical variable
data = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green', 'green']})

# Create an instance of BinaryEncoder
encoder = ce.BinaryEncoder(cols=['color'])

# Perform binary encoding
binary_encoded = encoder.fit_transform(data)

print(binary_encoded)
# Output:
#    color_0  color_1  color_2
# 0        0        0        1
# 1        0        1        0
# 2        0        1        1
# 3        0        0        1
# 4        0        1        0
# 5        0        1        0

In conclusion, encoding categorical variables is an important step in preparing data for machine learning models. Scikit-learn offers various techniques, such as label encoding, one-hot encoding, and binary encoding, to transform categorical variables into a numerical format that can be effectively utilized for training models. Depending on the nature and properties of the categorical variables, different encoding techniques can be chosen to ensure accurate and meaningful representation of the data.