3 Query Language

(Author: Adam Przepiórkowski, Jakub Wilk; modified: 2 October 2011)

Poliqarp’s query syntax is based on that of Corpus Query Processor (CQP), perhaps the most popular program of this kind, created at the University of Stuttgart, but it contains a number of additional features and improvements. 1 The present section describes the syntax of Poliqarp queries and illustrates it with numerous examples.

3.1 Searching for orthographic forms

In the simplest case, a query is just a sequence of segments, e.g.:

There are three segments in the latter query above, corresponding to two words: przyszedłem and rano. In the case of simple queries like the two queries above, Poliqarp attempts to identify those words which might consist of smaller segments and to handle them properly, so also the following queries will give the expected results:

In case of the latter query, Poliqarp will find all occurrences of the three-segment sequence [długo][m] [szedł], interpretable as an adverb (długo ‘long’), an agglutinate (-m ‘be’), and an l-participle (szedł ‘walk, go’), as well as all occurrences of the two-segment sequence [długom] [szedł], where the first segment is interpreted as a dative nominal form (długom ‘debts’), and the second – again, as an l-participle.

By default, queries are interpreted in a case-sensitive manner, so the following queries will produce different results:

In order to find all occurrences of the form przyszedł, regardless of case, the flag /i should be used. Thus, the two queries below will produce the same results, which will in particular contain all results of both queries above.

Both in the graphical version and in the text version of Poliqarp, case sensitivity can be set globally, for a whole query or a series of queries.

Queries may contain standard regular expressions over characters, specified with the help of the following special characters: ?, *, +, ., ,, |, {, }, [, ], (, ), as well as natural numbers; segment specifications containing regular expressions must be enclosed in quotes ". Since the formal introduction of regular expressions lies far outside the scope of the current publication, we will be content with discussing just a few examples, which, nevertheless, should allow the user to understand the syntax and semantics of such regular expressions.

  1. "Ala|Ela"

    the character | introduces the alternative of two expressions, so the query above can be used to find all occurrences of segments of the form Ala or Ela,

  2. "[AE]la

    square brackets denote the alternative of characters within them, so the query above can be used to find those segments whose first character is A or E, and the following two characters are la, i.e., this query is equivalent to the previous query,

  3. "beza?"

    the question mark signals the optionality of the character or the expression in parentheses which immediately precedes it, so the question above will be used find all occurrences of the segments bez and beza,

  4. "bez."

    the period denotes any character, so the results of this query will include beza, bezy, bezą, etc., but not bez or bezami,

  5. "bez.?"

    bez, beza, bezy, bezą, etc., but not bezami,

  6. ".z.z."

    5-character segments, where 2nd and 4th characters are z (e.g., czczą and rzezi),

  7. ".z.z..?"

    segments of length 5 or 6, where 2nd and 4th characters are z (e.g., czczą, rzezi and szczyt),

  8. "a*by"

    the asterisk denotes any number of occurrences of the character or the expression in parentheses which immediately precedes it, so this query can be used to find segments beginning with any number of as, followed by by, e.g., by (zero occurrences of a), aby, aaaaby, etc.,

  9. "Ala.*"

    segments beginning with Ala, e.g., Ala and Alabama,

  10. "ala.*"/i

    segments beginning with ala, Ala, aLa, ALA, etc., e.g., Ala, alabaster and ALABAMA,

  11. ".*al+"

    the plus has a similar interpretation as the asterisk: it denotes any number greater than zero of occurrences of the character or the expression in parentheses which immediately precedes it, so this query can be used to find segments ending in al, all, alll etc., but not in a, e.g., dal, robal and Gall,

  12. "a{1,3}b.*"/i

    the expression of the form n,m denotes from n to m occurrences of the character or the expression in parentheses which immediately precedes it; in this case, the query above can be used to find segments beginning with 1 to 3 occurrences of a or A, followed by b or B, and then followed by any sequence of characters, e.g., Aby, aaaby, absolutnie, ABBA,

  13. ".*(la){3,}.*"

    n, means at least n occurrences, so this query will help to find segments which contain at least three occurrences of the sequence la in a row, e.g., tralalala, sialalala,

  14. "[bcćdfghjklłmnńprsśtwzźż]{4,}[aąeęioóuy]"/i

    segments consisting of at least 4 consonants and exactly 1 vowel, e.g., źdźbła i Chrzczę,

  15. "([bcćdfghjklłmnńprsśtwzźż]{3}[aąeęioóuy])2,"/i

    segments consisting of at least two sequences of the type CCCV, where C is a consonant, and V is a vowel, e.g., wszystko, Zdmuchnąwszy i Szmajdziński; n means exactly n occurrences,

  16. "([^aąeęioóuy]{3}[aąeęioóuy])2,"/i

    as above,

  17. "(pod|na|za)jecha.*"

    segments beginning with podjecha, najecha or zajecha, e.g., podjechał, zajechawszy.

The specifications of segments given above must match complete segments, rather than only their parts, hence the necessity of flanking the sequence (la){3,} in query 13. above with the regular expression .*, matching any sequence of characters (also the empty sequence). The same effect can be achieved with the help of the flag /x, which means that the given specification must be matched by a subsequence of the segment, not necessarily by the complete segment:

3.2 Searching for base forms

The following query may be used in order to find all forms of the lexeme korpus:

The base attribute is one of many attributes that may be used in a query. The value of this attribute should specify the base form (the lemma), so a query like [base=pisać] can be used to find forms such as pisać ‘write’ (infinitive), piszę (non-past form), pisała (l-participle), piszcie (imperative), pisanie (gerund), pisano (impersonal), pisane (adjectival participle), etc.

Another attribute that may be used in queries is orth. The values of this attribute specify segments, so each of the following pairs contains queries which are equivalent.

On the other hand, the two queries below are not equivalent:

In the first case, Poliqarp will guess that the word przyszedłem may consist of two segments, przyszedł and em, and will expand the query accordingly, as described in §3.1. In contrast, the value of orth is always interpreted as the specification of a single segment.

The values of base and orth may contain regular expressions of the kind described in §3.1 above, e.g.:

3.3 Higher order queries

Queries about segments and about base forms may be combined. For example, the following query may be used to find all occurrences of the segment minę understood as a form of the lexeme mina ‘mine, face’ (and not, say, as a form of the lexeme minąć, ‘to pass’):

A similar effect can be achieved with the help of the following query, about those occurrences of the segment minę which are not interpreted as forms of minąć.

The condition that the base form be different from minąć may also be specified by putting the negation (the exclamation mark) before the name of the attribute, so the query below is equivalent to the query above.

Just as in the propositional calculus, double negation is equivalent to no negation, so the following queries about the segment nie understood as a form of the pronoun on are fully equivalent:

In Poliqarp queries, the operator & plays the role of logical conjunction. The operator dual to & is |, which plays the role of logical disjunction, e.g.:

In order to better understand the difference between the operators & and |, let us compare the effect of the following two queries:

The result of the former query will consist of those segments which simultaneously (conjunction) have the orthographic form minę and are interpreted as a form of the lexeme mina. On the other hand, the result of the latter query will consists of segments which either (disjunction) have the orthographic form minę, regardless of the interpretation of this segment, or are a form of the lexeme mina (e.g., mina, miny, minami). Hence, the latter query should return many more results than the former query.

As the examples above show, specifications of corpus positions, enclosed in square brackets, may contain any number of conditions of the type attribute=value, combined with the operators !, & and |. It is also possible to completely omit any conditions – the query below could be used to find all segments in the corpus.2

This trivial specification of corpus positions, matching any segment, may be useful for finding two forms in a certain distance from each other, e.g., two segments separated by two other segments, as in the following query:

The result of this query will include sequences such się nikogo nie bać, się Boga nie boicie, etc.

It would perhaps be more interesting to specify the upper limit on the number of segments which may intervene between two forms, not just the exact number of such intervening positions. Poliqarp makes it possible to pose such queries, as it allows to posit regular expressions also over corpus positions. For example, the following query may be used to find a form of the lexeme bać occurring two, three or four positions after the segment się:

The result of this query will contain all the sequences found by the previous query, as well as sequences such as się każdy następny Rywin będzie bał.

A more accurate query concerning various occurrences of the inherently reflexive verb bać się should find się within a certain window before a form of the lexeme bać, but without any intervening punctuation (intervening punctuation will often indicate clause boundary), or immediately after a form of bać, separated from that form by at most a single personal pronoun:

3.4 Searching for tags

The rather baroque query above can be simplified by replacing the condition orth!="[.!?,:]" with a direct reference to the ‘grammatical class’ interp:

In general, the values of the pos attribute are the abbreviations of names of grammatical classes discussed in §2.2 (cf. the table in 2.2). For example, a query about a sequence of two nominal forms beginning with an a may be formulated as follows:

The specifications of the values of pos may, just as in case of orth and base, contain regular expressions. For example, taking into account the fact that personal pronouns are split between the class of 3rd person pronouns ppron3 and non-3rd person pronouns ppron12, the following queries may be used to find any form of any personal pronoun:

That means that the query about bać się may be further simplified:

Apart from the specifications of segments (with the help of orth), base forms (base) and grammatical classes (pos), queries may contain specifications of particular grammatical categories, such as case or gender. The following attributes may be used to this end (cf. §2.1):

attribute possible values
number sg pl
case nom gen dat acc inst loc voc
gender m1 m2 m3 f n
person pri sec ter
degree pos comp sup
aspect imperf perf
negation aff neg
accentability akc nakc
post-prepositionalitynpraep praep
accommodability congr rec
agglutination agl nagl
vocalicity nwok wok
fullstoppedness pun npun

Hence, it is possible to pose the following queries:

  1. [number=sg]

    find singular forms

  2. [pos=subst & number=sg]

    find singular nominal forms

  3. [pos=subst & gender!=f]

    find masculine and neuter nominal forms

  4. [number=sg & case="nom|acc" & gender="m[123]"]

    find singular masculine forms in the nominative or in the accusative case

The following three-letter abbreviations may be used instead of the full names of attributes:

attribute abbreviation
number nmb
case cas
gender gnd
person per
degree deg
aspect asp
negation neg
accommodability acm
accentability acn
post-prepositionalityppr
agglutination agg
vocalicity vcl
fullstoppedness fsn

For example, the query below is equivalent to 4. above:

In the graphical and text versions of Poliqarp, it is possible to define so-called aliases, i.e., abbreviations for alternative values of a given attribute, which may themselves be used as if they were possible values of attributes. The current version of the National Corpus of Polish has four such aliases already pre-defined:

aliasdefinition
masc m1 m2 m3
noun subst depr ger ppron12 ppron3
pron ppron12 ppron3 siebie
verb fin praet aglt bedzie inf imps impt pact ppas pcon pant ger winien

With the definitions of the aliases noun and masc given above, the following two queries are equivalent:

The values of grammatical classes and categories may be specified jointly, with the use of the tag attribute. For example, the following query may be used to find singular nominative neuter nouns:

The values of the tag attribute have the form kl:kat1:kat2:...:katn, where kl is the name of a grammatical class, while each of kati is the value of a grammatical category appropriate for that class, in the order specified in the table in 2.2.

Just as in case of other attributes, also the specification of the value of tag may contain regular expressions, e.g.:

Specification of grammatical classes and grammatical categories may contain variables (having the form $n, where n is a single digit), whose values will be set only during execution of the query. For example, the following query for an adjective and a following noun agreeing in case:

can be simplified to:

3.5 Ambiguities

One of the features that distinguish the National Corpus of Polish and Poliqarp from other corpora and search tools is the representation and processing of ambiguities. There are cases where it is impossible to tell which of a number of interpretations is the right one, as in 1. below.

  1. Pamiętam pijaną.
    remember.1sther.accdrunk.acc/ins

    ‘I remember her drunk.’

Since it is impossible to resolve the grammatical case of pijaną in 1., both interpretations, accusative and instrumental, should be marked in the corpus as correct in this context.

However, given that after disambiguation a single segment may contain more than one interpretation, the question arises whether such ambiguous segments, e.g., pijaną in 1., should be included in the result of a query which matches only some of these interpretations, e.g., in the result of the query [case=acc]. On the one hand, the segment pijaną should be included in the result of [case=acc], as accusative is one of the correct interpretations of this segment in this context, but on the other hand, this segment should not be included, as it is not absolutely certain that this is an accusative form.

Instead of choosing between these interpretations of a query like [case=acc], Poliqarp allows the user to pose both kinds of queries. When a single equality sign is used, as in [case=acc], all segments whose at least one interpretation matches the given condition will be returned, so both pijaną and in 1. will be included in the result of this query. On the other hand, when two equality signs are used, as in [case==acc], only those segments will be returned whose all interpretations satisfy the condition expressed with ==, i.e., in 1., only the form will match the query.

With this distinction in hand, it is possible to search for forms which, e.g., may in a given context be interpreted as either accusative or genitive, so – given a properly tagged corpus – the following query should give non-empty results.

Conversely, the query below matches those segments whose all interpretations in the given context are at the same time accusative and genitive, so it will necessarily produce empty results.

The queries above pertain to interpretations which are the result of morphosyntactic disambiguation. The National Coropus of Polish contains also all other interpretations assigned to a given segment by the morphological analyser. In some situations it is useful to have access to such interpretations rejected by the disambiguator, e.g., for the task of finding all syncretic forms of a certain kind in the corpus, or when investigating disambiguation errors. For example, in order to find all syncretic accusative/genitive forms in the corpus, regardless of their interpretation in contexts in which they occur, the following query may be posed:

The final equality operator available in Poliqarp queries is ~~. The following query may be used for finding those forms which are unambiguously accusative, again, regardless of the context in which they occur.

The table below summarises the four equality operators put at the user’s disposal in Poliqarp.

in the results of in the results of
morphological analysis disambiguation
at least one interpretation~ =
each interpretation ~~ ==

It should be clear that the following implications hold:

3.6 Constraining matches to sentences or paragraphs

Texts contained in the National Corpus of Polish are divided into sentences and paragraphs. This information may be taken into account in queries, in order to constrain a query to a sentence or a paragraph, as in the query below, which may be used to find the form się separated from a form of the verb bać by any positive number of (non-się) segments, but within a sentence.

Similarly, the qualifier ‘within p’ constrains the scope of a query to a paragraph.

3.7 Constraining matches with metadata

Each text in the National Corus of Polish comes with a set of data about that text, such as its title and author, publisher, date of publication, etc. Some of such metadata are accessible through Poliqarp and may be used to constrain the scope of a query, e.g., to texts by a given author or published between certain dates.

The following meta-attributes are available in the National Corpus of Polish:

Usually only some of these attributes will have a value defined, e.g., when only the date of the publication is known, not the date of the first publication or the date of origin, or in case of short newspaper notes, which might lack information about the author or even the title.

In order to constrain the scope of a query with metadata, the keyword meta should be placed at the end of the query and it should be followed by specifications of values of meta-attributes. In case the scope of the query is also constrained to a sentence or to a paragraph, the specification of metadata should follow the structural constraint, e.g.:

Just as in case of ordinary attributes such as orth or pos, also the specifications of values of the meta-attributes author and title may contain regular expressions. For example, the query below may be used to find forms of the lexeme wirus in those texts whose title contains one of the sequences: windows or microsoft.

By default, the specifications of values of author and title are taken to be case-insensitive and they are interpreted as matching (at least) parts of values of appropriate meta-attributes, so the following query will find sequences of nominal forms in works by, inter alia, Pol, Polkowski and Rampolski:

To change that default behaviour, the flags /X and /I may be used. The effect of these flags is dual to the effect of the flags /x and /i described above: the effect of /X is that a given specification of the value of an attribute is understood as matching the complete value of that attribute, while the flag /I enforces the case-sensitive interpretation, as in examples below:

Regular expressions are not allowed in case of the date-valued attributes created, acquired, recorded, first_published and published. On the other hand, it is possible to use the lesser/greater signs < and >, e.g.:

Constraints on meta-attributes may be combined with the operators &, | and !, e.g:

Former demos of the National Corpus of Polish and editions of the IPI PAN Corpus are using different metadata schemes. Please refer to The IPI PAN Corpus Cheatsheet for details.

3.8 Aligning matches

In order to make the results of a query more readable, it is possible to place within the query proper, i.e., before the qualifiers within and meta, a special alignment marker, ^, as in:

Instead of the usual three columns containing the left context of the match, the match itself, and the right context, the results of this query will be split into four columns, containing, respectively, the left context, the left match, i.e., the sequence of segments matching the part of the query before the alignment marker ^ (here, a non-empty sequence of nominative adjectives), the right match (here, a non-empty sequence of nominative nouns), and the right match.