Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation

by Alexandra George

Length: 320 pages
Edition: 1
Language: English
Publisher: BPB Publications
Publication Date: 2022-03-26
ISBN-10: 9389898781
ISBN-13: 9789389898781
Sales Rank: #1200796 (See Top 100 Books)

0 ratings

Print Book Look Inside

Make use of the most advanced machine learning techniques to perform NLP and feature extraction

Key Features

Learn about pre-trained models, deep learning, and transfer learning for NLP applications.
All-in-one knowledge guide for feature engineering, NLP models, and pre-processing techniques.
Includes use cases, enterprise deployments, and a range of Python based demonstrations.

Description

Natural Language Processing (NLP) has proven to be useful in a wide range of applications. Because of this, extracting information from text data sets requires attention to methods, techniques, and approaches.

‘Python Text Mining’ includes a number of application cases, demonstrations, and approaches that will help you deepen your understanding of feature extraction from data sets. You will get an understanding of good information retrieval, a critical step in accomplishing many machine learning tasks. We will learn to classify text into discrete segments solely on the basis of model properties, not on the basis of user-supplied criteria. The book will walk you through many methodologies, such as classification, that will enable you to rapidly construct recommendation engines, subject segmentation, and sentiment analysis applications. Toward the end, we will also look at machine translation and transfer learning.

By the end of this book, you’ll know exactly how to gather web-based text, process it, and then apply it to the development of NLP applications.

What you will learn

Practice how to process raw data and transform it into a usable format.
Best techniques to convert text to vectors and then transform into word embeddings.
Unleash ML and DL techniques to perform sentiment analysis.
Build modern recommendation engines using classification techniques.

Who this book is for

This book is a good place to start with examples, explanations, and exercises for anyone interested in learning more about advanced text mining and natural language processing techniques. It is suggested but not required that you have some prior programming experience.

Cover Page
Title Page
Copyright Page
Dedication Page
About the Author
About the Reviewer
Acknowledgement
Preface
Errata
Table of Contents
1. Basic Text Processing Techniques
    Introduction
    Structure
    Objectives
    Data preparation
    Project 1: Twitter data analysis
        Scraping the data
        Data pre-processing
        Importing necessary packages
        HTML parsing
        Removing accented characters
        Expanding contractions
        Lemmetization and stemming
            Fail case
        Removing special characters
        Removing stop words
        Handling emojis or emoticons
        Emoji removal
        Text acronym abbreviation
        Twitter data processing
        Extracting usertags and hashtags
    Project 2: In-shots data pre-processing
        Importing the necessary packages
        Setting the urls for data extraction
        Function to scrape data from the urls
        Importing packages
    Conclusion
    Questions
    Multiple choice questions
        Answers
2. Text to Numbers
    Introduction
    Structure
    Objectives
    Feature encoding or engineering
    One-hot encoding
        Corpus
        Code
        Creating the text corpus
        Some basic pre-processings
            Min_df
            Max_df
        Limitations
    Bag of words
        Code
        Performing bag-of-words using sklearn
        Difference between one-hot encoding and bag of words
        Limitations
    N-gram model
        Limitations
    TF-IDF
        Code
        Performing TF-IDF using sklearn
    Project -1
    Solution
        Loading the dataset
        Some basic pre-processings
        One-hot encoding
        Bag of words
        Bag of N-grams model
    Project -2
        Loading the dataset
        Some basic pre-processings
        TF-IDF
        Comparison of One-Hot, BOW, and TF-IDF
    Conclusion
    Questions
    Multiple choice questions
        Answers
3. Word Embeddings
    Introduction
    Structure
    Objective
    Word vectors or word embeddings
        Difference between word embeddings and TF-IDF
    Feature engineering with word embeddings
    Word2Vec
        Code
        t-SNE
        Word similarity dataframe
    Global Vector (GloVe) Model
        The GloVe Model using Spacy
        Loading the downloaded vector model
        Word vector dataframe
        t-SNE visualization
        Word similarity dataframe
    fastText
        fastText using Gensim
        t-SNE visualization
        Finding Odd word out using FastText
    Difference between Word2Vec, GloVe, and FastText
    Using pre-trained word embeddings
        Importing necessary libraries
        Loading the Word2Vec model
        Sample data initialization
        Pre-processings and word tokenizations
        Extracting list of unique words
        t-SNE visualization
    Project
        Solution
        Importing necessary libraries
        Loading the Word2Vec model
        Scrapping data from inshots
        Pre-processings and word tokenizations
        Extracting list of unique words
        Removing words not in vocab
        t-SNE visualization
    Conclusion
        Project
4. Topic Modeling
    Introduction
    Structure
    Objectives
    Topic modeling
        Identity a matrix
        Unitary matrix
        Eigen values and Eigen vectors
        Singular value decomposition
        Latent semantic indexing
        TF-IDF vectorization
        Building an SVD model
        Looking at the topics and the words contributing to the topic
        Advantages and disadvantages of LSI
    Latent Dirichlet Allocation
        Introduction
        Working
        About the data
        Some pre-processing
        Looking at the top 20 frequently used words
        Some EDA
        Generating Bi-grams (BoW)
        LDA model fitting
        LDA using Gensim and its visualization
        Importing the data
        Some pre-processing
        Extending stop words and building ngram models
        Creating term document frequency and the LDA model
        Dominant topic identification
        PyLDAvis
        Disadvantages of LDA
    Non-Negative Matrix Factorization (NMF)
        Importing necessary libraries
        Some pre-processing
        Looking at the top 20 frequently used words
        Some EDA
        Generating Bi-grams (BoW)
        Building TF-IDF vectorizer
        Visualizing ranks with the TF-IDF weights
        NMF modelling
        Disadvantages of NMF
    Conclusion
    Questions
        Answers
    Projects
5. Unsupervised Sentiment Classification
    Introduction
    Structure
    Objective
    Lexicon-based approach
        About the dataset
        Loading necessary libraries
        Importing the dataset
        Some pre-processings
        Defining a function to perform the following
    Opinion lexicon
        Importing the opinion lexicon
        Tokenize the reviews into a sentence and form the sentence and review the ID
        Sentiment classification
        Converting the sentiments to a review level
        Converting the sentiment codes from the dataset to sentiments
    Senti WordNet lexicon
        Function to perform SentiWordNet
        Sentiment classification
        Evaluation
    TextBlob
        Importing libraries
        Predicting a sentiment of sample reviews
        Prediction and evaluation
    AFINN
        Importing necessary libraries
        Sentiment classification and evaluation
    VADER
        Importing necessary libraries
        Sentiment classification and evaluation
        Sample prediction
        Drawbacks of lexicon-based sentiment classification
    Conclusion
    Questions
        Answers
6. Text Classification Using ML
    Introduction
    Structure
    Objectives
    Supervised learning
        About the dataset
        Loading the necessary libraries
        Importing the dataset
        Pre-processings
        Performing TF-IDF
    Model fitting
        Logistic regression
        Lasso regularization
        Ridge regularization
        Elastic-net classifier
        Naïve Bayes algorithm
        K – Nearest Neighbors
        Decision tree
        Random forest
        Ada Boost
        Gradient boosting machine
        XG-Boost
    Grid Search
    Conclusion
    Questions
        Answers
    Project
7. Text Classification Using Deep Learning
    Introduction
    Structure
    Objectives
        Learning about the Neural Networks
    Neural networks for sentiment classification
    Neural networks with TF-IDF
        Installing libraries
        Importing libraries
        Importing the dataset
        Pre-processings
        Train, test, and validation set
        Performing TF-IDF
        Model building
            Linear regression
            Increasing the dimensionality
        Activation functions
        Model fitting
        Cross – validation
    Neural networks with word2vec:
        Data splitting
        Creating a Word2Vec model
        Word2Vec model fitting
        Creating word vectors
        Padding sequences
        ANN model building
        Model fitting
        Cross-validation
        Sentiment analysis using LSTM
        Importing the dataset
        Pre-processings
        Data splitting and padding
        LSTM model building
        Cross-validation
        Comparison of results
    Conclusion
    Questions
        Answers
8. Recommendation Engine
    Introduction
    Structure
    Objective
    Applications
    Classification of a recommendation system
        Simple rule-based recommenders
            About the dataset
            Installing and loading necessary libraries
            Importing the dataset
            Building a simple rule-based recommendation system
            Weighted ratings calculation
            Applying the calculation on the filtered records
        Content based
            Using document similarity
            About the dataset
            Installing and loading necessary libraries
            Importing the dataset
            Some pre-processing
    Extract TF-IDF features
    Computing pairwise document similarity
    Building a movie recommender
        Using word embedding
            FastText
        Generate document-level embeddings
        Collaborative-based
            User-based
            About the dataset
            Installing and loading necessary libraries
            Importing the dataset
    Advantages of a recommendation system
    Conclusion
    Questions
        Answers
9. Machine Translation
    Introduction
    Structure
    Objectives
    Application
    Types of MT
    Readily available libraries
        TextBlob
        LangDetect
        Fasttext
    Sequence-to-sequence modeling
        About the dataset
        Installing and loading necessary libraries:
        Importing the dataset
        Preprocessing
    Model building (using LSTM)
    Conclusion
    Exercise
    Questions
        Answers
10. Transfer Learning
    Introduction
    Structure
    Objectives
    Universal Sentence Encoder
        Goal
    What is a transformer and do we need it?
    Deep Averaging Network (DAN)
        About the data
        Data pre-processing
    Bidirectional Encoder Representation from Transformer (BERT)
        What is the necessity of BERT?
        The main idea behind BERT
        Why is BERT so powerful?
        BERT architecture
            Text processing
            Pre-training tasks
    Fine tuning
    Drawbacks
    Conclusion
    Multiple choice questions
        Answers
    Project
Index

AI & Machine Learning Data Modeling & Design Machine Theory Natural Language Processing

Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation

Probability and Statistics for Machine Learning: A Textbook

Building Data-Driven Applications with LlamaIndex: A practical guide to retrieval-augmented generation (RAG) to enhance LLM applications

Deep Learning Tools for Predicting Stock Market Movements

Machine Learning Infrastructure and Best Practices for Software Engineers: Take your machine learning software from a prototype to a fully fledged software system

Data-Driven Farming: Harnessing the Power of AI and Machine Learning in Agriculture

TensorFlow Developer Certification Guide: Crack Google’s official exam on getting skilled with managing production-grade ML models