Leverage Turing Intelligence capabilities to integrate AI into your operations, enhance automation, and optimize cloud migration for scalable impact.
Advance foundation model research and improve LLM reasoning, coding, and multimodal capabilities with Turing AGI Advancement.
Access a global network of elite AI professionals through Turing Jobs—vetted experts ready to accelerate your AI initiatives.
Traditional machine-learning concepts are used for various purposes in many businesses today. The process usually involves gathering and storing data, its preparation, and utilization to train machine-learning models. These models enable regression, classification, or clustering predictions.
Additionally, they are used to build complex deep-learning models that specifically focus on Natural Language Processing (NLP) or Image Processing. However, the accuracy of these models depends on the quality of the data used.
In this article, we will dive into the concept of feature engineering and explore how it helps to improve model performance and accuracy. Feature engineering involves synthesizing raw data to provide more valuable insights for our machine-learning models. This article will show how to use Python programming language to carry out feature engineering concepts.
Feature engineering is the process of transforming selected features in a dataset to create certain patterns, provide insight, and improve understanding of the data. This will eventually improve the accuracy of the model when trained with the data.
Features are the unique properties or characteristics of a particular record in a dataset. By carrying out engineering methods, which include data manipulation, cleaning, and transforming, it provides a better understanding of the dataset as a whole.
Data analysts and/or scientists carry out the feature engineering process. These are done because raw data are not very useful in creating machine learning models because of missing values, inconsistent, or irrelevant information, etc.
Feature engineering is important in traditional machine learning concepts. The following are the importance of feature engineering:
1. Enhanced model performance with well-engineered features: When feature engineering techniques are carried out on features in a dataset, machine learning models are provided with reliable data that enables them to provide better accuracy and results.
2. Improved data representation and pattern extraction: Properly engineered or transformed features provide reliable and detailed insights into data. This also aids data scientists or analysts in drawing out valuable conclusions from it.
3. Dimensionality reduction and prevention of overfitting: Dimensionality reduction involves removing or filtering unuseful or irrelevant features which in turn will yield better model performance, especially in high dimension data. Dimensionality reduction reduces the chance of model overfitting.
4. Handling missing data effectively: Feature engineering involves methods in which missing data are handled without harming model performance.
5. Incorporating domain knowledge into the model: Applying feature engineering techniques allows us to include domain knowledge by selecting useful features and removing irrelevant features in the dataset before training in the machine learning model.
In this section, we will look into some feature engineering techniques in Python, what they do, and their uses.
Data are gathered in raw format, most of which are unstructured. There are instances where such data contains missing values. Machine learning models don’t perform well with data containing missing values. There are several ways of handling missing values in a dataset.
Dropping or removing all records containing missing values is one of those ways, but this leads to data loss which is why it's not advisable. Let's find out other ways to handle missing values that don’t have the risk of data loss.
Let's consider a dataset that contains information about students, including their age, test scores, and grades. We will intentionally introduce some missing values in the dataset to demonstrate how to handle them using different techniques.
Code:
Output:
We used the SimpleImputer from Scikit-learn to fill in missing values in the 'Age' and 'TestScore' columns with their respective mean or median values. We can also use the most frequent value for categorical data. In this example, we used mean imputation for simplicity.
Code:
Output:
We used the fillna() method with 'ffill' and 'bfill' methods to propagate the previous or next valid value to fill the missing values in the 'Age' and 'TestScore' columns.
Code:
Output:
We used the interpolate() method to fill in missing values to create a smooth progression between the existing data points.
Code:
Output:
We used the KNeighborsRegressor from Scikit-learn to predict the missing values based on the k-nearest neighbors of the missing data points in the 'Age' and 'TestScore' columns.
Machine learning algorithms and models only work with numerical or boolean data, Strings or categorical values must be converted into a numerical format. The conversion is done using some encoding techniques.
Let's consider a dataset containing information about fruits, including their type and color. We'll explore three techniques for handling categorical data: One-Hot Encoding, Label Encoding, and Target Encoding.
Code:
Output:
We used the OneHotEncoder from Scikit-learn to convert categorical features into binary vectors, where each category converts into a separate binary column. We dropped the first category to avoid multicollinearity issues. One-Hot Encoding is suitable when the categorical features do not have a natural order.
Code:
Output:
We used the LabelEncoder from Scikit-learn to transform each category in the 'Fruit' and 'Color' columns into numerical values. Label Encoding is of use when the categorical features have an ordinal relationship.
Code:
Output:
We used the TargetEncoder from the category_encoders library to encode categorical features by replacing each category with the mean target value of the target variable. Target Encoding is helpful when dealing with high-cardinality categorical variables.
Feature Scaling is a method of feature engineering that involves transforming features. The features are transformed into floats within a boundary of values, usually between 0 and 1. The features, being within the same boundary have none dominating the other.
Let's consider a dataset that contains information about students, including their age, test scores, and grades. We will demonstrate two feature scaling techniques: Min-Max Scaling (Normalization) and Standardization.
Code:
Output:
We used the MinMaxScaler from Scikit-learn to scale the features to a specified range (usually [0, 1]). This transformation preserves the original distribution of the data and is suitable when the data has a bounded range.
Code:
Output:
We used the StandardScaler from Scikit-learn to scale the features to have a mean of 0 and a standard deviation of 1. This technique is useful when the data has outliers or non-normal distribution.
Creating polynomial features is another method of feature engineering. Giving power to existing features to create polynomial features.
Let's consider a dataset containing information about houses, including the area and their corresponding sale prices. We will demonstrate how to create polynomial features to capture non-linear relationships between the house area and sale prices.
Code:
Output:
In this example, we created a sample dataset with the 'Area' of houses and their corresponding 'SalePrice'. We then used PolynomialFeatures from Scikit-learn to create polynomial features to capture the non-linear relationship between the 'Area' and 'SalePrice'. We chose a degree of 2 (quadratic) to create polynomial features up to the square of the 'Area'.
The polynomial features help capture the non-linear patterns in the data, and we then used linear regression to fit a model to these features. The model can now predict the sale prices of houses based on their areas, accounting for the non-linear relationship.
Feature Selection is a feature engineering technique that selects only dominating or relevant features in a dataset. It uses algorithms to determine which features have the most impact or relation to the target variable. When a model is trained only with the relevant features selected, it can improve the machine learning model’s accuracy.
Let's consider a dataset that contains information about students' performance, including their study hours, test scores, grades, and participation in extracurricular activities. We will demonstrate two feature selection techniques: Univariate Feature selection, and L1 Regularization (Lasso).
Code:
Output:
We used SelectKBest from Scikit-learn to select the top k features based on their relevance with the target variable using the f_regression score function.
Code:
Output:
We used Lasso regression, which applies L1 regularization, to penalize features with low importance by driving their coefficients to zero. We selected the top k features with the highest absolute coefficients.
In this article, we discussed what feature engineering is, the importance of feature engineering in training machine learning models, and how to implement them using Python programming languages.
Feature engineering is a great skill to acquire as a data scientist or a machine learning engineer. In addition to these feature engineering techniques listed in the above article, other advanced techniques are used while dealing with computer vision, Natural Language Processing (NLP), or Time series.
Ezeana Michael is a data scientist with a passion for machine learning and technical writing. He has worked in the field of data science and has experience working with Python programming to derive insight from data, create machine learning models, and deploy them into production environments.