Stemming in Data Mining

Stemming is the process of reducing a word to its base or root form. For example, it removes suffixes and prefixes to get the core meaning of a word, known as the "lemma." This is important in Natural Language Understanding (NLU) and Natural Language Processing (NLP), which help computers understand human language.

Stemming is also used in linguistic studies, artificial intelligence (AI), and information retrieval (like search engines). It helps find relevant information in large data sources, like the internet, by searching for different forms of a word related to a topic.

When a new word is found, stemming can help discover its base form and reveal new research opportunities. Often, using the basic form of a word (the lemma) gives the best results. To find the lemma, stemming can be done by a person or an algorithm in an AI system.

Simple stemming algorithms can be easy to make, but they can have errors. For example, they might incorrectly reduce "laziness" to "lazi" instead of "lazy." These simple algorithms also struggle with words whose forms don't match the base word perfectly.

History of Stemming

The first published stemmer was created by Julie Beth Lovins in 1968. Her work was a big milestone and influenced later efforts in the field of stemming. She mentioned three important early attempts at stemming algorithms:

One by Professor John W. Tukey from Princeton University.
Another by Michael Lesk from Harvard University, supervised by Professor Gerard Salton.
A third by James L. Dolby from R&D Consultants in California.

Later, Martin Porter developed a stemmer that was published in the journal Program in July 1980. This stemmer became widely used and became the standard for English stemming. In 2000, Dr. Porter received the Tony Kent Strix prize for his contributions to stemming and information retrieval.

Stemming Important

The English language has many different forms of the same word. When these variations appear in a text, they can cause unnecessary repetition, which can make NLP or machine learning models less effective.

To build a strong model, it's important to normalize the text by removing these repetitions and changing words to their base form using stemming. This helps make the model more efficient and accurate.

Error in Stemming

There are two main errors in stemming:

Over-stemming: This happens when two different words are incorrectly reduced to the same root. It’s like mistakenly thinking two words are the same when they aren’t. This is a false positive.
Under-stemming: This happens when two words that should be reduced to the same root are not. It’s like failing to recognize that two words are actually related. This is a false negative.

Truncating or Affix Removal Methods
These methods involve removing parts of a word, such as prefixes or suffixes. The simplest form of this approach is the Truncate (n) stemmer, which cuts a word down to a specific number of letters (n). Words that are already shorter than n are left unchanged. This method can lead to "over-stemming" when the word is very short. A basic example is the S-stemmer, which is used to convert plural forms of nouns into their singular form by removing suffixes.

Lovins Stemmer
The Lovins stemmer, developed in 1968, is more comprehensive than the Porter stemmer because it uses a large list of suffixes. It works by removing the longest suffix from a word, then adjusting the resulting stem to ensure it's a valid word. One of the benefits of the Lovins stemmer is that it's faster than other methods, even though it has a large suffix list. The Lovins stemmer has 294 suffixes, 29 conditions, and 35 transformation rules. In the first step, it finds the longest suffix that matches one of the conditions and removes it. Then, in the second step, the transformation rules are applied to refine the word, whether or not a suffix was removed in the first step.

Statistical Methods
These stemmers use statistical techniques to analyze and remove affixes from words. They typically apply statistical procedures to decide which parts of the word to remove.

N-Gram Stemmer
The N-gram stemmer is an interesting and language-independent method. It uses a string-similarity approach to find the stem of a word. An "n-gram" is a sequence of n consecutive characters taken from a word or text. For example, if n equals 2, it creates pairs of adjacent characters (called digrams), or for n equals 3, it creates triplets (called trigrams). The basic idea is that words with similar meanings will share many of the same n-grams. Some stemmers use this concept to choose the correct root form of a word by comparing these n-grams.

Inflectional and derivational methods for stemming aim to handle the morphology of words, accounting for both syntactic variations (like plural forms, case, or gender) and changes in part-of-speech (POS). These methods analyze the underlying linguistic structure of words to reduce them to their base or root form.

The Krovetz Stemmer (KSTEM), introduced by Robert Krovetz in 1993, is one such approach. It is a linguistic lexical validation stemmer, primarily designed to handle the inflectional nature of words. This stemmer is more sophisticated compared to simpler, rule-based ones, as it uses language syntax and inflectional properties to identify and strip away inflectional suffixes. Krovetz’s stemmer operates in a series of steps, typically three, to remove suffixes and return words to their base form, aiming for a higher level of accuracy in handling word variants.

However, due to the complexity of this method, it requires a large corpus to be effective and is part of corpus-based stemmers. Its performance and accuracy are enhanced when used on a broad and diverse set of linguistic data.

Xerox Inflectional and Derivational Analyzer

The Xerox Inflectional and Derivational Analyzer is part of the set of linguistic tools developed by Xerox’s linguistics group, designed to assist with English and other languages in tasks like information retrieval. This analyzer involves two core databases:

Inflectional Database: This database reduces surface forms of words to their base forms (also known as lexemes). Examples of how it works:
- Nouns: The plural form is reduced to the singular (e.g., "children" to "child").
- Verbs: Any past tense or conjugated form is reduced to the infinitive (e.g., "understood" to "understand").
- Adjectives: Irregular forms are reduced to their positive degree (e.g., "best" to "good").
- Pronouns: Irregular forms are converted to the nominative form (e.g., "whom" to "who").
Derivational Database: This database reduces surface forms to stems that are related both in form and meaning to the original word. For instance:
- "government" becomes "govern," but "department" is not reduced to "depart" because of semantic differences.
- Suffixes and prefixes are removed based on specific patterns, such as:
  - Suffixes: "-ly", "-ness", "-ic", "-al", "-ation", "-ment", etc.
  - Prefixes: "anti-", "dis-", "in-", "pre-", "re-", etc.
- This process typically uses suffix removal, as opposed to more complex prefix removal.

Corpus-Based Stemmer

Corpus-based stemming utilizes statistical methods and the co-occurrence of word variants within a corpus to modify conflation classes. By doing this, it seeks to improve on some limitations found in the Porter Stemmer. The basic idea is that words with the same stem, but different meanings (e.g., "policy" and "police"), should not be conflated. Conversely, words like "index" and "indices" (which share the same root) should be conflated.

The process generally works as follows:

The Porter Stemmer is used initially to identify word stems.
Corpus statistics are then used to redefine the conflation rules. Statistical measures (like co-occurrence) help in determining when and how words should be conflated based on their usage in the corpus.

Advantages:

More accurate stemming based on actual data, reducing inappropriate conflation.
Ensures that the resulting stems are actual words.

Limitations:

Requires a separate statistical model for each corpus.
Processing time can be significant due to the dual application of stemming algorithms.

Context-Sensitive Stemmer

This method of stemming applies statistical modeling to handle queries more intelligently. Before submitting a query to a search engine, it predicts the useful morphological variants of words. This helps in improving precision and reducing errors caused by inappropriate stemming. The steps involved include:

Candidate Generation: The Porter stemmer generates initial word stems.
Distributional Word Similarity: Statistical analysis calculates how likely two words are to co-occur in the same context, helping in predicting the most relevant variants.
Query Segmentation and Headword Detection: Long queries are broken down into noun phrases, with the most important word (the headword) detected.
Context-sensitive Word Expansion: Word variants are expanded based on n-gram language models, focusing on the most likely variants for the context.
Context-sensitive Document Matching: After word variants are identified, document matching only happens if the variants appear in a similar context within the document.

Advantages:

More selective word expansion on the query side.
Context-sensitive document matching, ensuring that only relevant results are retrieved.

Limitations:

High processing time due to the complexity of the analysis.
Possible errors in detecting noun phrases and determining the context of word occurrences.

In summary, both the Xerox Inflectional and Derivational Analyzer and Corpus-Based Stemmer are advanced linguistic tools designed to refine stemming accuracy based on the syntactic and semantic properties of words. The Context-Sensitive Stemmer further builds on this by applying intelligent, context-based modeling to enhance web search precision.

Stemming in Data Mining