ParaSol: A Parallel Corpus of Slavic and other languages


ParaSol is a parallel aligned corpus of translated and original belletristic texts in Slavic and some other languages. It is being developed by , currently at the Slavic Institute of Zurich University .
Initially called the Regensburg Parallel Corpus (RPC), it was developed from 2006-2013 in cooperation with , now Humboldt University, at the Institute of Slavic Languages and Literatures, University of Bern and the Institute of Slavistics, Regensburg University; since 2014 Ruprecht von Waldenfels has been the sole developer. For more information, see references Waldenfels (2006, 2011, 2012) below.
For more details, see Waldenfels (2006, 2011) in the references below. Access to ParaSol is provided by a web interface. For more information, please contact .

Watch a short demonstration of the corpus interface here (12 MB). More information on the query language can be found on the web pages of the open CWB project; see also these short introductions in German and in English.

The corpus interface for ParaSol was developed by Roland Meyer, Ruprecht von Waldenfels and Andi Zeman and can be easily adapted for other parallel corpora. It is called ParaVoz and available as open source


March 2014
The corpus has been updated. It now includes 27 mio tokens in 31 languages. Most of the texts are tagged and lemmatized.
December 2012
Some Dutch, Portuguese and Romanian texts added and the corpus realigned. Serbian, Croatian and Macedonian as well as Romanian texts are now POS-tagged and lemmatized. We thank the colleagues who have made this possible (see below)!
May 2012
Some changes in access policy.
July 2011
The new XSLT-based interface is online and many new texts available in all major Slavic literatary languages are now available. The corpus now includes over 25 mio token in 32 languages; 5 texts are available in more than 12 languages.
November 2010
Many new texts have been added. Several texts are now available in all major Slavic literary languages, partly made possible by a cooperation with Emmerich Kehlih from Graz University (see the section on cooperations). Besides, the corpus now also includes translations into French, Italian, Lithuanian, Latvian, Modern Greek and many more non-Slavic languages. A new, experimental web interface is online now and in further development. Export functions have been added, and more improvements are coming.


cnc_simple.gif        The Czech part of ParaSol was tagged by Alexandr Rosen and Drahomíra Spoustová from the Czech National Corpus.
For a description of the tagger, see Spoustová et. al (2007).
Quick reference to the tag set, more detailed information.

logo-snk.png        The Slovak part of ParaSol was tagged by Radovan Garabík from the Slovak National Corpus.
For a description of the tagger, see Garabík (2005). For the tag set, see the Slovak National Corpus home page

IPIPAN_logo.png        The Polish part of ParaSol was first tagged by Adam Przepiórkowski from the IPI PAN Corpus of Polish.
For a description of the tagger, see Piasecki & Godlewski (2006). For the tag set, visit IPI PAN.

LogoDCL_greenGrossneu.png        The Bulgarian texts were tagged by Svetla Koeva from the Bulgarian Academy of Sciences using the resources of the Bulgarian National Corpus.

LogoRusCorpora.gif        The Ukrainian and Belarusian texts have been partly tagged and lemmatized thanks to Dmitri Sitchinava from the Russian National Corpus.

EANC-logo.gif        The Armenian texts were contributed by the Eastern Armenian National Corpus (special thanks to Misha Daniel!).

QuantA.png        Emmerich Kelih from the Department for Slavic Studies at Graz University contributed the Ostrovskij - Subcorpus (Kelih 2009a, 2009b), Nikolaj Ostrovskij's classical social realist novel Kak zakaljalas' stal' in 11 major slavic languages as well as a major part of the Bulgakov Subcorpus consisting of Mikhail Bulgakov's Master i Margarita in all existing Slavic translations.

MatFakBeograd.png         The Serbian part of ParaSol was tagged by Miloš Utvić from the Human Language Technology Group at the University of Belgrade and the Corpus of Contemporary Serbian (SrpKor). For a description of the tagging and tagset, see Utvić (2011). Aspectual information added independently (see below).

FilFakZagreb.png         The Croatian part of ParaSol was tagged by Željko Agić from the Faculty of Humanities and Social Sciences, Zagreb University. For a description of the tagging and tagset, see Agić, Tadić & Dovedan (2008, 2009). Aspectual information added independently (see below).

CLCLGeneva.png         The Macedonian and Romanian part of ParaSol was tagged and lemmatized by Tanja Samardžić and Andrea Gesmundo from the Computational Learning and Computational Linguistics (CLCL) research group, University of Geneva, using the BTTagger (see Gesmundo & Samardžić 2012). Ruprecht von Waldenfels helped prepare the Macedonian training data. Aspectual information added independently (see below).
Aspectual information for Croatian, Serbian and Macedonian was not part of the original tagging and was derived by Ruprecht von Waldenfels using heuristics and lexicographic sources including CROVALLEX (Mikelić Preradović 2008; thanks for providing the data!), the Hrvatski Jezični Portal, and others.


