From the course: Deep Learning: Getting Started
Creating text representations
- [Instructor] Let us proceed to create text representations for spam data. Code for this preprocessing is available in section 5.2 of the notebook. Data in this example, is available in the CSV file, Spam.Classification.csv in the Exercise Files folder. We load this data into a pandas data frame and print its content to check it. We then separate the feature and target attributes into separate variables. Let's run this code. As we can see, the spam message has a lot of special characters and words that need to be cleaned. To perform the required pre-processing, we first create a custom tokenizer function. This function first splits the sentences into tokens using the tokenizer in nltk library. Then it filters for stopwords. Finally, it lemmatizes the words and returns them in a lemmatized array. We create a TfidfVectorizer model using the custom vectorizer. We build a model using the spam messages attribute, and also transform them into a TfidfVector. We the convert this vector into a numpy array. The feature variables are now ready for deep learning. For the target variable, we first converted into numeric values using a label encoder. This encoder provides encoding for two classes. Then we create a one-hot encoding vector using the keras.utils. The target variable is now ready. We print the size of the feature and target variables. We then split the dataset into training and test sets. Let's run this code now. The feature variables have 4,566 columns, and the target variable has two. We can now proceed to build the deep learning model.