Moksha corpora
Welcome to the start page of Moksha language corpora: the Main corpus of literary Moksha (contains mostly press) and the Corpus of Moksha-language social media.
Details To the main corpus To the social media corpusWelcome to the start page of Moksha language corpora: the Main corpus of literary Moksha (contains mostly press) and the Corpus of Moksha-language social media.
Details To the main corpus To the social media corpusThis is the main page of the website where linguistic corpora of Moksha language are located. Currently, two corpora are available: the corpus of contemporary written literary Moksha (“the Main corpus”) and the corpus of Moksha-language social media and forums. They differ in what kind of texts the contain, but have mostly identical annotation and search capabilities. Here is a brief comparison:
Main corpus | Social media corpus | |
---|---|---|
Language | Moksha | Moksha and Russian |
Size | 1.74 million words | 14 thousand words (the Moksha part) 166 thousand words (the Russian part) |
Texts | contemporary press (up to November 2018) — 86.4%%; translation of the New Testament — 8.9%; 20th century fiction — 0.8%; blogs — 0.7% | open posts and comments by Moksha-speaking vkontakte users (up to December 2018) |
Language variety | in most cases, standard written literary Moksha or close to it | language of digital communication: closer to the spoken variety, influenced by the dialects and Russian language, contains numerous code switching instances |
Annotation |
|
|
Metadata |
|
|
Apart from the corpora available here, there exists another publicly available Moksha corpus developed by Jack Rueter. It contains 800 thousand tokens of fiction, but has no morphological annotation.
You can find more detailed information about Moksha Social media corpus and its development in this paper. Please consider citing this paper if your research is based on this corpus:
Timofey Arkhangelskiy. 2019. Corpora of social media in minority Uralic languages. Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages, pages 125–140, Tartu, Estonia, January 7 - January 8, 2019.
A language corpus is a collection of texts in that language which has been enriched with additional linguistic information, called annotation, and, preferably, equipped with a search engine. Here you will find a short list of frequently asked questions about the Moksha corpora.
— Who needs corpora?
First of all, corpora are used by linguists. The search engine and annotation of corpora are designed in such a way that you can make linguistic queries such as “find all nouns in the genitive case” or “find all forms of the word тядя followed by a verb”. Apart from linguists, corpus can be a useful tool for language teachers, language learners, and even the native speakers.
— Can I use the corpus as a library?
No, these corpora are not designed for that. When you work with a corpus, you make a query, i.e. search for a particular word, phrase or construction, and get back all sentences that contain what you searched for. By default, the sentences are showed in random order. You can expand the context of each of the sentences you get, i.e. look at their neighboring sentences. However, you may do so only a limited number of times for each sentence. Therefore, it is impossible to read an entire text in the corpus. This is done for copyright protection.
— Can I use the corpus as a dictionary?
Each Moksha word in the corpus has Russian translation (no English translations are available at the moment). However, they are only provided as auxiliary information for users who do not speak Moksha. The translations in the corpus are kept short and simple by design, they do not list all senses and do not provide usage examples like real dictionaries. If you want to know how to translate a word, the right way to do so is consulting a dictionary.
— What is morphological annotation and how do you get it?
The corpora located here are lemmatized and morphologically annotated. Lemmatization means that each word in the texts is annotated with its lemma, i.e. dictionary/citation form. Morphological annotation means that each word is annotated for its grammatical features, such as part of speech, number, case, tense, etc. Since the corpora in question are too large for manual annotation to be feasible, they were annotated automatically with a program called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Moksha inflection. The analyzer together with the necessary materials is freely available in my bitbucket repository. Automatic annotation unfortunately means that, first, out-of-vocabulary words are not annotated, and, second, that some words have several ambiguous analyses. For example, confronted with the form валда, the analyzer cannot determine whether it should be analyzed as the citation form of of валда (“bright”) or the ablative of the word вал (“about a word”). Russian sentences in the social media corpus were annotated with the mystem analyzer.
Moksha is one of the two Mordvinic languages, which belong to the Uralic family. The number of speakers is unknown due to the fact that in the censuses, most Erzya and Moksha speakers indicate “Mordvin” as their language; it can be very roughly estimated at 200,000. Moksha uses Cyrillic orthography based on the Russian alphabet. All morphological markers are suffixes that mostly attach to the stem agglutinatively. Nominal grammatical categories are number, case, definiteness and possessiveness. Transitive verbs can index person and number of the subject and the direct object. The direct object can be marked either in the nominative or in the genitive (DOM). The word order in the sentence is free, with SVO (subject – verb – object) being the default.
If you have questions, would like to propose collaboration, or noticed an error in the corpusexcept typos in blogs and social media: these text are left "as is", please contact Timofey Arkhangelskiy. You can also use the Moksha morphological analyzer and the tsakorpus corpus platform, which are open source and freely available.