jannatjahan2222
Do³±czy³: 06 Mar 2024 Posty: 1
|
Wys³any: Sro Mar 06, 2024 11:27 Temat postu: Word embeddings: AI that navigates text |
|
|
Most modern artificial intelligence (AI) techniques have been developed to work with numbers, which can present a challenge when it comes to working with words and text; To overcome this possible limitation, a class of algorithms has been created that convert words into numbers, known as word embeddings , and make it much easier to take advantage of modern intelligence techniques. artificial when you want to analyze natural language.
Word embeddings incorporate a corpus of text and generate a numerical vector for each word in the corpus, creating a language model that can be Industry Email List used to guide a wide range of classification and information retrieval processes, such as those carried out by search engines like Google, Bing, etc. Language models consist of groupings of numerical vectors that represent syntactic (context) and semantic (meaning) similarities between words. If a bilingual training corpus is used, certain algorithms will also detect similarities between languages.
As an example, a language model that we developed at the IDB identified that the word “ metrics ” was closely related to the term “key performance indicators” and “ key performance indicators” , its equivalent in English. From here, there are all kinds of amazing vector math that can be done to explore and infer relationships between words in the corpus, but for this article we will focus on one specific example.
As a knowledge manager, I have found that word embeddings are immensely powerful in understanding our universe of knowledge, this includes understanding the way and language in which an institution describes its work, that is, its specific jargon. In this context, embeddings become a mirror that reflects the institutional lexicon and this reflection can be used to improve the way knowledge is managed within an institution. This approach is particularly useful for deciphering what a user expects to find when performing a search that includes results that take such jargon into account.
Word embedding in practice There are numerous ways to generate word embeddings, and perhaps the best known is the open source Word2vec algorithm which, as its name implies, converts words into vectors. Word2vec worked well for building the Findit search engine in most cases. However, that algorithm had a critical limitation for our purpose: it did not allow us to infer terms related to words that were not explicitly mentioned in our original training corpus, and as such were no longer part of the model.
Even though our training corpus was quite large, at over 2 billion words, we ran into some situations where this aspect of the model caused it to fall short for our needs. For example, when a user searched for “electromobility”, a word that was not in the language model, no results were shown, not even related to terms as broad as “mobility”.
To overcome this challenge, we experimented with another open source algorithm: fastText. The main difference with this algorithm is that it also generates vectors at the letter level instead of just at the word level. This implies that it includes in its mapping the substrings of the words it analyzes. As a result, if the model trained by fastText encounters a word that it did not include in its initial training, it will look for substrings of that word and analyze whether they appear in the model. In general, it works as well as word2vec, but in our context it proved to have two important advantages:
It helped us get good results even when user queries had simple spelling errors. It was able to handle user queries with words that are not part of the training corpus, or words that are not yet in the language model, when there are sufficient letter-level similarities. For example, fastText would be able to identify a relationship between the words “meter” and “millimeter,” even if the word “meter” was not in the model. Implementing fastText helped us take our search application to the next level. We can't wait to show you what we will develop with this technology. _________________ Industry Email List |
|