ParaSol - A Parallel Corpus of Slavic and Other Languages

ParaSol: A Parallel Corpus of Slavic and other languages

Introduction

ParaSol is a parallel aligned corpus of translated and original belletristic texts in Slavic and some other languages. It is being developed by , currently at the Slavic Institute of Zurich University .
Initially called the Regensburg Parallel Corpus (RPC), it was developed from 2006-2013 in cooperation with , now Humboldt University, at the Institute of Slavic Languages and Literatures, University of Bern and the Institute of Slavistics, Regensburg University; since 2014 Ruprecht von Waldenfels has been the sole developer. For more information, see references Waldenfels (2006, 2011, 2012) below.
For more details, see Waldenfels (2006, 2011) in the references below. Access to ParaSol is provided by a web interface. For more information, please contact .

Watch a short demonstration of the corpus interface here (12 MB). More information on the query language can be found on the web pages of the open CWB project; see also these short introductions in German and in English.

The corpus interface for ParaSol was developed by Roland Meyer, Ruprecht von Waldenfels and Andi Zeman and can be easily adapted for other parallel corpora. It is called ParaVoz and available as open source

News

March 2014
The corpus has been updated. It now includes 27 mio tokens in 31 languages. Most of the texts are tagged and lemmatized.
December 2012
Some Dutch, Portuguese and Romanian texts added and the corpus realigned. Serbian, Croatian and Macedonian as well as Romanian texts are now POS-tagged and lemmatized. We thank the colleagues who have made this possible (see below)!
May 2012
Some changes in access policy.
July 2011
The new XSLT-based interface is online and many new texts available in all major Slavic literatary languages are now available. The corpus now includes over 25 mio token in 32 languages; 5 texts are available in more than 12 languages.
November 2010
Many new texts have been added. Several texts are now available in all major Slavic literary languages, partly made possible by a cooperation with Emmerich Kehlih from Graz University (see the section on cooperations). Besides, the corpus now also includes translations into French, Italian, Lithuanian, Latvian, Modern Greek and many more non-Slavic languages. A new, experimental web interface is online now and in further development. Export functions have been added, and more improvements are coming.

Cooperations

cnc_simple.gif        The Czech part of ParaSol was tagged by Alexandr Rosen and Drahomíra Spoustová from the Czech National Corpus.
For a description of the tagger, see Spoustová et. al (2007).
Quick reference to the tag set, more detailed information.


logo-snk.png        The Slovak part of ParaSol was tagged by Radovan Garabík from the Slovak National Corpus.
For a description of the tagger, see Garabík (2005). For the tag set, see the Slovak National Corpus home page


IPIPAN_logo.png        The Polish part of ParaSol was first tagged by Adam Przepiórkowski from the IPI PAN Corpus of Polish.
For a description of the tagger, see Piasecki & Godlewski (2006). For the tag set, visit IPI PAN.


LogoDCL_greenGrossneu.png        The Bulgarian texts were tagged by Svetla Koeva from the Bulgarian Academy of Sciences using the resources of the Bulgarian National Corpus.


LogoRusCorpora.gif        The Ukrainian and Belarusian texts have been partly tagged and lemmatized thanks to Dmitri Sitchinava from the Russian National Corpus.


EANC-logo.gif        The Armenian texts were contributed by the Eastern Armenian National Corpus (special thanks to Misha Daniel!).


QuantA.png        Emmerich Kelih from the Department for Slavic Studies at Graz University contributed the Ostrovskij - Subcorpus (Kelih 2009a, 2009b), Nikolaj Ostrovskij's classical social realist novel Kak zakaljalas' stal' in 11 major slavic languages as well as a major part of the Bulgakov Subcorpus consisting of Mikhail Bulgakov's Master i Margarita in all existing Slavic translations.


MatFakBeograd.png         The Serbian part of ParaSol was tagged by Miloš Utvić from the Human Language Technology Group at the University of Belgrade and the Corpus of Contemporary Serbian (SrpKor). For a description of the tagging and tagset, see Utvić (2011). Aspectual information added independently (see below).


FilFakZagreb.png         The Croatian part of ParaSol was tagged by Željko Agić from the Faculty of Humanities and Social Sciences, Zagreb University. For a description of the tagging and tagset, see Agić, Tadić & Dovedan (2008, 2009). Aspectual information added independently (see below).


CLCLGeneva.png         The Macedonian and Romanian part of ParaSol was tagged and lemmatized by Tanja Samardžić and Andrea Gesmundo from the Computational Learning and Computational Linguistics (CLCL) research group, University of Geneva, using the BTTagger (see Gesmundo & Samardžić 2012). Ruprecht von Waldenfels helped prepare the Macedonian training data. Aspectual information added independently (see below).
Aspectual information for Croatian, Serbian and Macedonian was not part of the original tagging and was derived by Ruprecht von Waldenfels using heuristics and lexicographic sources including CROVALLEX (Mikelić Preradović 2008; thanks for providing the data!), the Hrvatski Jezični Portal, and others.

References

Agić Ž., Tadić M. & Dovedan Z. 2008. Improving Part-of-Speech Tagging Accuracy for Croatian by Morphological Analysis. Informatica, 32(4), 2008, pp. 445-451.
Agić Ž., Tadić M. & Dovedan Z. 2009. Evaluating Full Lemmatization of Croatian Texts. Recent Advances in Intelligent Information Systems, Warsaw, Academic Publishing House EXIT, 2009, pp. 175-184.
Garabík, R. 2005. Levenshtein Edit Operations as a Base for a Morphology Analyzer. In: Garabík, R. (ed.): Computer Treatment of Slavic and East European Languages. Proceedings of Slovko 2005. Bratislava, 50 - 58.
Gesmundo, A & Samardzic, T. 2012. Lemmatisation as a Tagging Task. In: Proceedings of the 50th Annual Meeting of the ACL, Volume 2, pp. 368-372 download paper
Hajič, J. 2004. Disambiguation of Rich Inflection (Computational Morphology of Czech). Charles University Press. Prague.
Piasecki M., Godlewski, G. 2006. Reductionistic, Tree and Rule Based Tagger for Polish. In: Klopotek M. et al. (eds.): Intelligent Information Processing and Web Mining. Proceedings of the International IIS: IIPWM'06 Conference held in Ustron, Poland, June 19-22, 2006. Berlin.
Mikelić Preradović, N. 2008. Pristupi izradi strojnog tezaurusa za hrvatski jezik (Approaches to the Development of the Machine Lexicon for Croatian Language), PhD thesis, Faculty of Humanities and Social Sciences, University of Zagreb.
Spoustová, D.J. In prep. Kombinované statisticko-pravidlové metody značkování češtiny (Combining Statistical and Rule-Based Approaches to Morphological Tagging of Czech Texts). PhD Thesis, Prague.
Spoustová, D., Hajič, J., Votrubec, J., Krbec, P. Květoň, P. 2007. The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech. In: Proceedings of the Workshop on Balto-Slavonic Natural Language. Prague, 2007, 67-74. Available online here.
v. Waldenfels, R. 2006. Compiling a parallel corpus of slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (Hrsg.); Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 9. München, 123-138. Download paper
v. Waldenfels, R. 2011. Recent developments in ParaSol: Breadth for depth and XSLT based web concordancing with CWB. In: Daniela M., and Garabík, R. (eds.), Natural Language Processing, Multilinguality. Proceedings of Slovko 2011, Modra, Slovakia, 20–21 October 2011. Bratislava, 156-162 Download preprint version
v. Waldenfels, R. 2012. Aspect in the imperative across Slavic - a corpus driven pilot study. In: A. Grønn and A. Pazelskaya (eds.): The Russian Verb. Oslo Studies in Language 4. 141--154 Download paper

Visitor Counter by Digits




File translated from TEX by TTHgold, version 4.00.
On 06 Jul 2011, 22:36.