Within the framework of the National Laboratory of Artificial Intelligence (MILAB) project coordinated by SZTAKI, researchers at the University of Szeged have developed and made freely available the HuSpaCy Hungarian language analysis system, which is ready for industrial use with better resource requirements and integrability. The system combines the latest research results in artificial intelligence and language technology into an easy-to-use tool for analyzing Hungarian texts.
Artificial intelligence-based algorithms for analyzing Hungarian texts kept pace with the digital development of major world languages until around 2010, when we were left behind: new methods favored languages spoken by many. The last decade has seen a breakthrough in language technology, not only in research, but also in bringing academic results to a level of technological maturity that can be used in industry. Today, even small companies that lack AI expertise can solve text analysis problems.
The new HuSpaCy system can help in this area: it simplifies the grammatical and semantic interpretation of Hungarian texts.
"We have created a set of tools for pre-processing content and sentences in Hungarian. This is necessary because any application that wants to solve a problem related to a text cannot work with raw character sets alone. Algorithms that work on natural language texts rely on human-interpretable grammatical symbols, so HuSpaCy can serve as a suitable basis for chatbots or even email parsing systems," explains Richárd Farkas, researcher at the University of Szeged.
AI revolution in language technology
The last decade has seen a revolution in artificial intelligence research: within machine learning solutions, there has been a push for so-called deep learning, where artificial neural networks can learn how to interpret what they are supposed to interpret.
This is how most of the natural language processing systems in use today work, i.e., rules are not written by linguists, but by learning algorithms that can learn deeper relationships and predictions. Examples of such well-known deep learning methods are BERT or OpenAI's GPT-3 algorithm.
However, there is a problem with such systems: they basically behave like a black box. Their operation is barely observable, so even if they give good results, we still don't know how they came to that conclusion. Therefore, they are not well controlled and therefore often of limited use in industrial applications. Consider that such a system decides whether you can get credit. Even in today's English language target applications, machine learning-based solutions are often used only for pre-analysis of texts, and then the final decision is made based on rules written by a human expert. In this way, a decision becomes transparent (e.g. the result of a machine credit assessment can be easily interpreted) and, in case of doubt, the human expert can even change the behavior of the system.
The development of text analysis software in Hungarian did not start today. The Hungarian research community started building the necessary linguistic databases already in the 2000s. These databases were also used by the HuSpaCy developers as a teaching database.
The HuSpaCy system is a generational change: it combines the advantages of deep learning methods with the interpretability and controllability of linguistic analysis. The system is able to perform full language analysis of sentences (syllables, word types, etc.) and to identify nouns (e.g. personal names, places) in running text. HuSpaCy builds on today's AI tools: it includes neural language models that the user can use to analyze the similarity of texts, but the grammar analysis steps mentioned above are all based on modern algorithms.
"HuSpaCy is part of the spaCy framework, which has become a quasi-international standard in recent years. Thus, all the languages that fit into the framework are effectively part of the digital language revolution," says George Orosz, head of the HuSpaCy project.
The newly created HuSpaCy system can also be the basis for voice or written chatbots (such as those being developed at the National Laboratory for Artificial Intelligence), but it can also be useful for text categorization (for example, for automatic sorting of complaints to customer service), information retrieval and automatic text generation.