morphosyntactic or other linguistic annotation such as lemmatization is included for many languages
(for details of the current state, see this list.)
Annotation is partly done locally, partly we cooperate with institutions in several Slavic countries (see below).
alignment and most of the preprocessing is
fully automatic and language independent.
The corpus has been updated. It now includes 27 mio tokens in 31 languages. Most of the texts are tagged and lemmatized.
Some Dutch, Portuguese and Romanian texts added and the corpus realigned. Serbian, Croatian and Macedonian as well as Romanian texts are now POS-tagged and lemmatized. We thank the colleagues who have made this possible (see below)!
Some changes in access policy.
The new XSLT-based interface is online and many new texts available in all major Slavic literatary languages are now available. The corpus now includes over 25 mio token in 32 languages; 5 texts are available in more than 12 languages.
Many new texts have been added. Several texts are now available in all major Slavic literary languages, partly made possible by a cooperation with Emmerich Kehlih from Graz University (see the section on cooperations). Besides, the corpus now also includes translations into French, Italian, Lithuanian, Latvian, Modern Greek and many more non-Slavic languages. A new, experimental web interface is online now and in further development. Export functions have been added, and more improvements are coming.
Emmerich Kelih from the Department for Slavic Studies at Graz University contributed the Ostrovskij - Subcorpus (Kelih 2009a, 2009b), Nikolaj Ostrovskij's classical social realist novel Kak zakaljalas' stal' in 11 major slavic languages as well as a major part of the Bulgakov Subcorpus consisting of Mikhail Bulgakov's Master i Margarita in all existing Slavic translations.
The Macedonian and Romanian part of ParaSol was tagged and lemmatized by Tanja Samardžić and Andrea Gesmundo from the Computational Learning and Computational Linguistics (CLCL) research group, University of Geneva, using the BTTagger (see Gesmundo & Samardžić 2012). Ruprecht von Waldenfels helped prepare the Macedonian training data.
Aspectual information added independently (see below).
Aspectual information for Croatian, Serbian and Macedonian was not part of the original tagging and was derived by Ruprecht von Waldenfels using heuristics and lexicographic sources including
CROVALLEX (Mikelić Preradović 2008; thanks for providing the data!), the Hrvatski Jezični Portal, and others.
Agić Ž., Tadić M. & Dovedan Z. 2008. Improving Part-of-Speech Tagging Accuracy for Croatian by Morphological Analysis. Informatica, 32(4), 2008, pp. 445-451.
Agić Ž., Tadić M. & Dovedan Z. 2009. Evaluating Full Lemmatization of Croatian Texts. Recent Advances in Intelligent Information Systems, Warsaw, Academic Publishing House EXIT, 2009, pp. 175-184.
Garabík, R. 2005. Levenshtein Edit Operations as a Base for a Morphology Analyzer. In: Garabík, R. (ed.): Computer Treatment of Slavic and East European Languages. Proceedings of Slovko 2005. Bratislava, 50 - 58.
Gesmundo, A & Samardzic, T. 2012. Lemmatisation as a Tagging Task. In: Proceedings of the 50th Annual Meeting of the ACL, Volume 2, pp. 368-372 download paper
Hajič, J. 2004. Disambiguation of Rich Inflection (Computational Morphology of Czech). Charles University Press. Prague.
Piasecki M., Godlewski, G. 2006. Reductionistic, Tree and Rule Based Tagger for Polish. In: Klopotek M. et al. (eds.): Intelligent Information Processing and Web Mining. Proceedings of the International IIS: IIPWM'06 Conference held in Ustron, Poland, June 19-22, 2006. Berlin.
Mikelić Preradović, N. 2008. Pristupi izradi strojnog tezaurusa za
hrvatski jezik (Approaches to the Development of the Machine Lexicon
for Croatian Language), PhD thesis, Faculty of Humanities and Social
Sciences, University of Zagreb.
Spoustová, D.J. In prep. Kombinované statisticko-pravidlové metody značkování češtiny (Combining Statistical and Rule-Based Approaches to Morphological Tagging of Czech Texts). PhD Thesis, Prague.
Spoustová, D., Hajič, J., Votrubec, J., Krbec, P.
Květoň, P. 2007. The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech. In: Proceedings of the Workshop on Balto-Slavonic Natural Language. Prague, 2007, 67-74. Available online here.
v. Waldenfels, R. 2006. Compiling a parallel corpus of slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (Hrsg.); Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 9. München, 123-138. Download paper
v. Waldenfels, R. 2011. Recent developments in ParaSol: Breadth for depth and XSLT based
web concordancing with CWB. In: Daniela M., and Garabík, R. (eds.),
Natural Language Processing, Multilinguality. Proceedings of Slovko
2011, Modra, Slovakia, 20–21 October 2011. Bratislava, 156-162 Download preprint version
v. Waldenfels, R. 2012. Aspect in the imperative across Slavic - a corpus driven pilot study. In: A. Grønn and A. Pazelskaya (eds.): The Russian Verb. Oslo Studies in Language 4. 141--154
File translated from
version 4.00. On 06 Jul 2011, 22:36.