Text Representation: From Vectorization to Semantic Meaning

SciencePedia

Key Takeaways

Vectorization is a fundamental process that transforms high-dimensional data, like matrices or text, into single vectors, enabling analysis with standard algebraic tools.
Modern text representation relies on word embeddings, which map words to points in a "semantic space" where geometric distances and directions capture meaning.
The concept of representation is a unifying theme that connects mathematics, computer science, and AI, underpinning everything from number encoding to creating multimodal "concept spaces."

Introduction

How can we teach a machine to understand the meaning behind "love" or "logic" when its world is built on numbers? This fundamental challenge lies at the heart of artificial intelligence and natural language processing. The answer is found in the elegant and powerful field of text representation, which provides the methods to translate the rich complexity of human language into the structured, numerical language of computers. This article bridges the gap between abstract concepts and computational reality, demystifying the process of turning words, documents, and even ideas into vectors that machines can interpret.

This exploration is structured in two parts. First, under "Principles and Mechanisms," we will delve into the mathematical foundation of representation, beginning with the simple yet profound act of vectorizing a matrix and extending this idea to construct modern word embeddings. We will uncover how basic linear algebra operations can reveal a matrix's secrets and how similar principles are used to build a "thought vector" for a document. Following this, the "Applications and Interdisciplinary Connections" chapter broadens our perspective, revealing how representation is a unifying thread that connects mathematics, computer science, and the frontiers of artificial intelligence. You will discover how a single good representation can unlock reasoning, generalization, and a deeper understanding of meaning itself. Our journey begins with the core principles that make this all possible.

Principles and Mechanisms

In our journey to teach machines to understand language, we must first solve a fundamental problem: how do we translate the rich, nuanced, and often messy world of human text into the rigid, numerical language of computers? A computer does not understand "love" or "logic"; it understands lists of numbers. The entire field of text representation is dedicated to building this bridge, and the principles behind it are as elegant as they are powerful. Our exploration begins not with words, but with a surprisingly simple and beautiful concept from linear algebra: turning a grid of numbers into a single list.

From Tables to Lists: The Idea of Vectorization

Imagine you have a table of numbers, like a spreadsheet or a digital photograph's pixel data. In mathematics, we call this a matrix. While this two-dimensional grid is intuitive for us, most foundational computational algorithms, especially in machine learning, are designed to work with one-dimensional lists of numbers, or vectors. So, how do we flatten a matrix into a vector?

The most common method is called vectorization. Think of reading a page of a a book. You could read the first column from top to bottom, then move to the second column and read it top to bottom, and so on. This is precisely the idea behind column-major vectorization. We take each column of the matrix, in order from left to right, and stack them on top of each other to form one long column vector.

Let's take a general $2 \times 3$ matrix $A$ :

A = \begin{pmatrix} a_{11} a_{12} a_{13} \\ a_{21} a_{22} a_{23} \end{pmatrix}

Its columns are $\begin{pmatrix} a_{11} \\ a_{21} \end{pmatrix}$ , $\begin{pmatrix} a_{12} \\ a_{22} \end{pmatrix}$ , and $\begin{pmatrix} a_{13} \\ a_{23} \end{pmatrix}$ . Stacking them gives us the vectorized form, denoted $\text{vec}(A)$ :

\text{vec}(A) = \begin{pmatrix} a_{11} \\ a_{21} \\ a_{12} \\ a_{22} \\ a_{13} \\ a_{23} \end{pmatrix}

This process is simple, deterministic, and fully reversible. We've lost no information, just rearranged it. Notice that the last element in this new vector is $a_{23}$ , which was the element in the last row and last column of the original matrix ( $a_{mn}$ for a general $m \times n$ matrix). This mechanical process works for any matrix, no matter its shape.

What about objects that are already vector-like? If we take a single column vector, which is an $m \times 1$ matrix, vectorization does exactly what you'd expect: it leaves it unchanged, since there's only one column to "stack". A row vector, a $1 \times n$ matrix, has a more interesting fate. Each "column" is just a single number, so vectorizing it means stacking these individual numbers into a single column vector. This consistent behavior is part of the mathematical elegance of the operation.

Of course, we could have chosen to read the matrix row-by-row, like reading a paragraph. This is called row-major vectorization. For our matrix $A$ , this would produce a different vector by rearranging the components. This distinction is important in practice, as different software libraries may use different conventions, but the underlying principle of flattening a structured grid into a list remains the same. For our discussion, we will stick to the more common column-major convention.

Unlocking Matrix Secrets with Vector Tools

You might be thinking, "Okay, we've turned a matrix into a vector. So what? Was this just a pointless reshuffling?" The answer is a resounding no. This transformation is profoundly useful because it allows us to use the powerful and well-understood tools of vector algebra to analyze matrices. To see how, we need to recall one of the most fundamental operations in mathematics: the inner product (or dot product).

The inner product of two vectors, written as $\mathbf{u}^T \mathbf{v}$ , essentially measures how much they point in the same direction. It's calculated by multiplying their corresponding components and summing the results. When we apply this to our new vectorized matrices, something remarkable happens.

Consider the inner product of $\text{vec}(A)$ with itself: $\text{vec}(A)^T \text{vec}(A)$ . This is the sum of the squares of all the components in the vectorized list. But since those components are just the rearranged elements of the original matrix $A$ , this value is identical to the sum of the squares of all the elements in the matrix, $\sum_{i,j} a_{ij}^2$ . This quantity is so important it has a name: the squared Frobenius norm of the matrix, $\|A\|_F^2$ . It measures the matrix's overall "magnitude" or "energy." Vectorization provides a beautiful bridge: the geometric concept of a vector's squared length in the flattened space is numerically identical to the matrix's squared Frobenius norm in its original space.

Now for a bit of magic. What if we vectorize the identity matrix, $I$ (a matrix with 1s on the diagonal and 0s everywhere else), and take its inner product with the vectorization of our matrix $A$ ? Let's try it for a $2 \times 2$ case.

A = \begin{pmatrix} a_{11} a_{12} \\ a_{21} a_{22} \end{pmatrix}, \quad I = \begin{pmatrix} 1 0 \\ 0 1 \end{pmatrix}

Their vectorizations are:

\text{vec}(A) = \begin{pmatrix} a_{11} \\ a_{21} \\ a_{12} \\ a_{22} \end{pmatrix}, \quad \text{vec}(I) = \begin{pmatrix} 1 \\ 0 \\ 0 \\ 1 \end{pmatrix}

The inner product $\text{vec}(I)^T \text{vec}(A)$ becomes:

\begin{pmatrix} 1 0 0 1 \end{pmatrix} \begin{pmatrix} a_{11} \\ a_{21} \\ a_{12} \\ a_{22} \end{pmatrix} = (1)(a_{11}) + (0)(a_{21}) + (0)(a_{12}) + (1)(a_{22}) = a_{11} + a_{22}

This is the trace of the matrix $A$ —the sum of its diagonal elements! The vector $\text{vec}(I)$ acts as a "selector" or a "mask," with its 1s perfectly positioned to pick out the diagonal elements from the long $\text{vec}(A)$ list and ignore everything else. This isn't a coincidence; we can construct a unique selector vector of 0s and 1s to compute the trace for a matrix of any size. This demonstrates that a matrix-specific operation can be elegantly rephrased as a standard inner product in the vector world.

Beyond Matrices: Giving Meaning to Words

This is all fascinating for manipulating grids of numbers, but what does it have to do with understanding a sentence like "The cat sat on the mat"? The leap of insight is to realize that we can create numerical representations of words and documents, and then apply these same mathematical principles.

The first, simplest attempt is the bag-of-words model. Imagine you have a dictionary of every word in a language, say, 50,000 words. We can represent any document as a 50,000-dimensional vector. For the sentence "The cat sat," we would put a '1' in the position corresponding to "the," a '1' for "cat," and a '1' for "sat." All other 49,997 entries would be '0'. If a word appears twice, we might put a '2'. This gives us a count vector—a simple, numerical fingerprint of the document.

However, this approach is quite primitive. It's a huge, sparse vector (mostly zeros), and it loses all sense of word order. To this model, "Man bites dog" and "Dog bites man" are indistinguishable. More importantly, it has no notion of meaning. It doesn't know that "cat" is more similar to "dog" than it is to "car."

To solve this, we move to a much richer idea: word embeddings. Instead of just a '1' to signal a word's presence, we assign each word its very own, dense vector—typically with a few hundred dimensions. This vector isn't arbitrary; it's learned from vast amounts of text data. In this high-dimensional "semantic space," vectors for words with similar meanings point in similar directions. This leads to the famous analogy: vector('king') - vector('man') + vector('woman') results in a vector very close to vector('queen'). The spatial relationships between these vectors capture semantic relationships. The collection of all these word vectors can be organized into a single large matrix, our embedding matrix $E$ , where each row is the vector for a specific word.

From Words to Thoughts: The Modern Approach

Now we can bring everything together. We have a bag-of-words count vector $x$ for a document, and an embedding matrix $E$ that knows the meaning of each word. How do we produce a single vector that represents the meaning of the entire document?

We perform a weighted sum. The representation of our document, let's call it $h$ , is calculated as $h = x^T E$ . Let's unpack this. This operation iterates through our vocabulary. For each word, it takes its embedding vector (a row from $E$ ) and multiplies it by the number of times it appeared in our document (the corresponding count from $x$ ). Finally, it adds all these scaled vectors together. A document containing "cat cat dog" becomes represented by 2 * vector('cat') + 1 * vector('dog'). The final document vector $h$ is a "thought vector"—a blend of the meanings of its constituent words, weighted by their frequency.

This simple-looking linear operation is the engine behind many modern text analysis models, and its properties are revealing.

Linearity and Interpretability: The whole pipeline, from word counts to a final classification, is often a series of linear transformations. This means we can precisely trace the impact of a single word. Adding a word with a strong negative sentiment embedding will predictably push the document's final classification towards "negative." We can calculate exactly how much each word contributes to the final outcome.
Order-Agnosticism: Because we start with a bag-of-words, this model inherits its blindness to word order. Grammar and syntax are lost. This is a fundamental limitation, and overcoming it is the focus of more advanced architectures like Transformers.
Aggregation Strategy: Instead of a sum, we could take the average of the word vectors (mean aggregation). This makes the final document vector insensitive to the document's length, which is useful if you want to compare the overall sentiment of a short tweet and a long essay on the same scale.

This journey—from the simple, mechanical act of vectorizing a matrix to the sophisticated, meaning-driven construction of a document embedding—reveals a unified theme. The goal is always to transform complex, high-dimensional information, be it a matrix or a string of text, into a single vector in a carefully designed space where mathematical operations reveal hidden structures and relationships. This is the foundational principle that allows us to turn the art of language into a science that machines can begin to understand.

Applications and Interdisciplinary Connections

We’ve seen the principle: turning text into numbers, symbols into vectors. At first glance, this might seem like a mere technical necessity for feeding information to a computer. But this act of translation, of finding a mathematical representation for abstract concepts, is far from a simple clerical task. It is a gateway. By representing things as vectors, we place them into a geometric landscape—a “concept space”—where we can measure their distance, find their direction, and see their relationships in a way we never could before. This single idea creates a powerful bridge connecting the deepest questions of mathematics, the fundamental limits of computation, and the frontier of artificial intelligence. Let us embark on a journey through these connections and see the world through the eyes of representation.

The Code of Reality: Representation in Mathematics and Computer Science

Our journey begins with the most familiar representation of all: writing down a number. When we write “123”, we are creating a text representation of a quantity. The rules for this representation are so ingrained that we forget how remarkable they are. They are algorithmic. To convert any integer into a string of digits in a chosen base, we can repeatedly apply the Division Algorithm, extracting digits one by one. This simple, elegant procedure is a cornerstone of how computers handle numbers, a perfect marriage of number theory and symbolic manipulation.

But a choice of representation is never neutral; it comes with its own peculiar properties. Consider the integers $\{1, 2, 10, 11, 20\}$ . If we ask a computer to sort these not by their value, but by their string representation as you would in a dictionary, it will tell you the order is $1, 10, 11, 2, 20$ . Why? Because the string "10" comes before "2" in lexicographical order. This simple example reveals a profound truth: the properties of the representation are not the same as the properties of the thing being represented. The symbolic world has its own rules, and we must be careful not to confuse them for the rules of the underlying reality.

This act of re-interpretation—treating a representation in a new way—can lead to astonishing insights. In the 19th century, Georg Cantor shocked the mathematical world by showing that some infinite sets are “bigger” than others. He proved that the set of rational numbers (all fractions) is “countable,” meaning it’s the same size as the set of natural numbers $\{1, 2, 3, \dots\}$ . How can this be proven? One beautiful way uses text representation. We can write any rational number in a standard string format, like “-27/11”. This string is made from a small alphabet of characters: the digits '0'-'9', a minus sign '-', and a slash '/'. If we assign a unique number to each of these characters (say, 0 through 12), we can interpret the entire string not as a description of a fraction, but as a single, unique integer in base 13! This creates a perfect one-to-one mapping from every possible rational number to a natural number, proving their countability with a trick borrowed from the heart of computer science.

The concept of representation is just as central to the theory of computation itself, which grapples with the difficulty of solving problems. Here, we often represent not just data, but entire problems as strings. A famous example is the reduction of the Boolean Satisfiability problem (SAT) to the Clique problem. This is a formal procedure that translates any given logic formula (a string) into a specific graph (another string-based representation). If you can find a clique of a certain size in the graph, you have solved the original logic problem. But can this translation be done cleverly, perhaps in a time that is "sublinear" or faster than the size of the input formula? The theory tells us no. At a minimum, any algorithm that produces an output must take the time to write it down. A representation, whether of a number or a complex computational problem, has a physical size, a cost in ink or bits, and this sets a fundamental speed limit on any process that creates it.

The Shape of Meaning: Representation in Artificial Intelligence

Having explored the precise, logical world of mathematics, we now turn to something far messier: human language, sight, and sound. Can we find a vector for "joy," or a geometric point for the idea of a "cat"? This is the central quest of modern artificial intelligence.

The journey begins by representing simple, qualitative categories. Suppose we want a model to understand the difference between three categories: 'A', 'B', and 'C'. A naive approach is to use "dummy variables," essentially giving each category its own private dimension in a vector space. This is clean and simple, but it's also rigid. The model has no inherent way to know that 'A' might be more similar to 'B' than to 'C'. To capture these richer relationships, we can learn "embeddings"—dense vectors in a lower-dimensional space where the geometry itself carries meaning. However, this power comes at a cost. With too little data, a model trying to learn these complex embeddings might overfit, hallucinating relationships that aren't really there. The choice of representation becomes a delicate dance, a trade-off between the model's expressive power and its vulnerability to noise—the classic bias-variance tradeoff of statistics.

Furthermore, the very coordinates we choose for this "meaning space" can be deceiving. Just as we saw with lexicographical sorting, the representation has its own quirks. In a linear model, we can represent our categories with different encoding schemes. These schemes will produce wildly different numerical coefficients in the model, suggesting different "effects" for each category. Yet, miraculously, they all produce the exact same final predictions. The underlying geometric projection—the model's core understanding—is invariant. The coefficients are just shadows cast on a particular choice of coordinate axes. It is a crucial lesson in distinguishing what is fundamental from what is an artifact of our description.

The true power of these "semantic spaces" comes when we embrace their geometric nature. What does the space between two concepts mean? If we have a vector for an image of a cat and a vector for an image of a dog, what does the point halfway along the line segment connecting them represent? Data augmentation techniques like Mixup are built on the bold assumption that this interpolated vector corresponds to an interpolated concept. For this to work in a multimodal setting—say, mixing an image of a cat with an image of a dog, and simultaneously mixing their text descriptions—requires a profound property: the geometry of the image space and the text space must be aligned. The path from "cat" to "dog" must have a similar shape in both modalities. When this holds, we have created a smooth, continuous space of meaning, where we can navigate and explore concepts as if they were places on a map.

The history of AI in the last decade is a story of building ever more sophisticated maps of this meaning-space. Early methods used static, pre-trained word embeddings like GloVe, which assigned a single vector to each word. This was a huge leap, but it couldn't distinguish "interest rate" from "human interest." The breakthrough came with contextual models like BERT, which use the Transformer architecture. These models don't have a single vector for a word; they generate a new one on the fly based on the surrounding sentence. This gives them an unprecedented grasp of nuance. When faced with a practical task, such as classifying financial news, the choice of representation is paramount. Training embeddings from scratch on a small dataset is often futile. Using general-purpose embeddings might miss domain-specific jargon. The most effective strategy is often to take a powerful, pre-trained contextual model and adapt it carefully, leveraging its vast knowledge while avoiding overfitting on the new, smaller task.

How do these giant models manage to encode so much nuance into a fixed-size vector? A look "under the hood" reveals a fascinating phenomenon being explored by researchers: superposition. A Transformer layer, despite its complexity, has a dimensional bottleneck. It simply does not have enough dimensions to assign a unique direction to every possible feature of language. So, it learns to pack them in an overlapping way, with features represented not by single dimensions but by patterns across many. The model's multiple "attention heads" can be seen as specialists that learn to navigate this dense, superposed information, focusing on different feature combinations as needed. This reveals that the model's internal representation of language is an incredibly rich, high-dimensional tapestry that we are only just beginning to understand.

The reward for building these powerful, geometrically coherent representations is a kind of magic. We can build systems that perform "zero-shot" learning—classifying an audio clip or an image into a category it has never been trained on before. This is done by simply comparing the input's vector with the vector for the text description of the unseen category. If the space is organized meaningfully, an audio clip of "rain" will land closer to the text vector for "rain" than for "dog bark." We can even go a step further and explore compositionality. By adding the vector for "speech" to the vector for "music," we can create a prototype for the concept of "speech over music" and use it to find corresponding examples. This ability to reason, generalize, and compose is the holy grail of intelligence, and it is unlocked by the power of a good representation.

From the simple act of writing down a number to a machine that can recognize concepts it has never seen, the thread that connects them is the idea of representation. By turning the world into vectors, we do more than just make it computable. We give it a shape, a geometry, and in doing so, we begin to uncover the hidden structure of meaning itself.