From the course: Applied AI: Getting Started with Hugging Face Transformers

Unlock the full course today

Join today to access over 24,400 courses taught by industry experts.

Vectorization

Vectorization

- [Instructor] A key pre-processing step for machine learning is vectorization. Vectorization is a set of techniques that are used to convert text data into its equivalent numerical representations that can then be consumed by machine learning algorithms. Vectorization techniques need to retain the content, sequencing, and context of the text representations in order to build meaningful models. Techniques for vectorization have evolved significantly over the past few years. Initially, the Bag of Words technique is used for vectorization. Here, each unique token in the vocabulary is considered a feature. A feature vector is built for a sentence with the value of one if the token is present in the sentence and a zero otherwise. This technique results in sparse vectors and does not capture information about the context or sequencing of words. This was then improved by the text frequency-inverse document frequency technique…

Contents