Over the last decade, Natural Language Processing (NLP) technologies have entered mainstream usage. Machine Translation, Intelligent Virtual Assistants like Siri and Alexa, clever assistive smartphone keyboards, and hyper-intelligent web search engines like Google have had, and continue to have, a deep impact on modern society. All smart systems working with natural language share one fundamental aspect: they require word representations for natural language that are consistent, compact, contain deep linguistic knowledge, and support resolution of ambiguity inherent in human language. Therefore, research into word representations that encode linguistic and semantic knowledge is still essential.
In this dissertation, I present my work on several important open problems of word representations: understanding what information they contain, faster computation, and intentional use of word representations in downstream tasks such as abbreviation disambiguation. I empirically show that Brown Clusters, a popular kind of word representations, are well suited for unsupervised learning of morphosyntactic information from natural languages, even when presented with small unstructured text corpora. This is based on extensive empirical experimentation with several indo-european languages. My research indicates that Brown Clusters are prime candidates for representing words in downstream NLP tasks that require syntactic knowledge.
Following, based on insights gained through empirical studies of algorithms that compute Brown Clusters, I propose methods to speed up cluster computation. My research shows that using Hybrid Exchange-Brown algorithm can be up to 21 times faster than the original Brown clustering algorithm without requiring specialized hardware.
Using insights into the inner workings of word2vec (another highly popular word representation method), I propose UAD, a state-of-the-art method for resolving the meaning of ambiguous abbreviations based on their usage in context. UAD intentionally uses word representations to support disambiguation of large numbers of abbreviations in the same model, in a completely unsupervised manner. I show that UAD outperforms previous state-of-the-art abbreviation disambiguation methods and can easily be used in new domains and languages as it does not rely on hand-designed, language-specific, features and is an unsupervised method. Furthermore, UAD’s intentional use of word representations results in disambiguation models that support the identification of challenging abbreviation meanings, and corrections without reliance on manually labeled data, or requiring model retraining.