(dictionary_data)= # Dictionary data Dictionary applications require dictionary data. (importjson-spec)= ## importjson morphodict requires dictionary data to be supplied in a custom format called `importjson`. It looks like this: [ ⋮, { "analysis": [[], "nîmiw", ["+V", "+AI", "+Ind", "+3Sg"]], "head": "nîmiw", "linguistInfo": { "stem": "nîmi-" }, "paradigm": "VAI", "senses": [{ "definition": "s/he dances", "sources": ["CW"] }], "slug": "nîmiw" }, { "analysis": [[], "nîmiw", ["+V", "+AI", "+Ind", "+X"]], "formOf": "nîmiw", "head": "nîminâniwan", "senses": [ { "definition": "it is a dance, a time of dancing", "sources": ["CW"] } ] }, ⋮ ] This format is a subset of JSON that is intended to match the internal data structures and terminology of morphodict code. It is not recommended as a long-term storage format. It does not have room for everything a linguist might want in a dictionary, only for the things that morphodict currently supports. And it is unlikely to use your preferred linguistic terminology. In practice you will want to store your canonical dictionary data in some other format, such as [Daffodil], and write a short script to convert that to importjson format. [Daffodil]: https://format.digitallinguistics.io **If you do not yet have all of the data available**—e.g., you may have a draft dictionary that does not have paradigms or analyses assigned to entries—it is still *hugely* beneficial if you can provide initial dictionary data containing: - *All the definitions* you have available - At least one FST-analyzable entry *for every paradigm* That is, when getting started, it’s better to have a full dictionary that needs revision and is sparse on detail but provides full details for at least one entry in every paradigm, than to have full details for more entries but only from one or two paradigms. And it’s much better to have some initial draft dictionary data covering the whole language, allowing people to start working with it, than to have no data at all while waiting for it to be ‘right.’ ### importjson formatting suggestions To make it easier for humans to look at the raw dictionary data and to compare different versions of that data, it is *extremely strongly recommended* that you run ./sssttt-manage sortimportjson FILENAME on any importjson files before committing or sharing them. That automatically implements the following suggestions: - Run the importjson files through [`prettier`][prettier] to format them nicely. If using the command line, you’ll need to add a `--parser=json` argument as prettier does not automatically recognize the `.importjson` file extension. - Sort the entries by `slug`, with any `formOf` entries coming immediately after the corresponding lemma entry, and related `formOf` entries sorted together by `head`. Compare strings using [NFD unicode normalization][NFD] so that accented and non-accented characters sort near each other. - Explicitly sort the *keys* of the emitted JSON objects. Otherwise the JSON object keys can be emitted in random or insertion order, creating unnecessary noise in diff output. In JS you can try the [json-stable-sort] package, and in Python the `json.dump` functions take an optional `sort_keys=True` argument. - Make sure that the JSON you are emitting has unicode strings instead of asciified strings with human-unfriendly escape sequences. For example, in python, the `json.dump()` function needs an extra `ensure_ascii=False` keyword argument or you will get `"w\\u00e2pam\\u00eaw"` instead of `"wâpamêw"`. [prettier]: https://www.npmjs.com/package/prettier [NFD]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize#canonical_equivalence_normalization [json-stable-sort]: https://www.npmjs.com/package/json-stable-stringify ### importjson specification An importjson file is a JSON file containing an array of entries. There are two kinds of entries, normal entries and formOf entries. #### Normal entries The fields are: - The `head` field, a required string, is the head for the entry, in the internal orthography for the language. This is what is literally displayed in as the headword in the entry. It is usually a single wordform, but also could be a multiword phrase, non-independent morpheme, stem, &c. - The `senses` field, a required array of `{definition: string, sources: string[], coreDefinition?: string, semanticDefinition?: string}` objects, contains the definitions for the entry. Only the `definition` and `sources` fields are required. The `definition` must currently be an unformatted string. We are aware that people would like to specify things such as source-language text to be shown in the current orthography, cross-references, notes, examples, and so on, in a more structured manner. If we were starting from scratch we might call the `definition` field a `displayDefinition`, but we already have some data, and often it is the only one definition field provided. The `sources` are typically short abbreviations for the name of a source. `sources` is an array because multiple distinct sources may give the same, or essentially the same, definition for a word. The optional `coreDefinition` field may specify a definition to use for auto-translation. For example, if an entry for ‘shoe’ has lots of details and notes, but when auto-translated into first person possessive it should simply become ‘my shoe’, you can specify the core definition as `shoe`. The optional `semanticDefinition` field may specify a definition to use instead of the main definition text for the purposes of search. It will be used instead of the plain `definition` field for indexing keywords, and when computing definition vectors for semantic search. This is related to the concept of the core definition, but may add additional relevant keywords, while leaving out stopwords or explanatory text such as ‘see’ in ‘[see other-word]’ or the literal word ‘literally.’ - The `slug` field, a required string, is a unique key for this entry that is used for several purposes, including user-facing URLs, to make `formOf` references, and for homograph disambiguation. Each normal entry must provide a `slug` that is distinct from every other normal entry. It is recommended that this field be the same as the `head`, including diacritics, but with any URL-unsafe characters stripped, and a homograph disambiguator added at the end if needed for uniqueness. How to create disambiguators is up to the lexicographer. For example, for the Plains Cree homograph `yôtin`, the lexicographer might set the slugs for the three entries to `yôtin@1` and `yôtin@2` and `yôtin@3`; or to `yôtin@na` and `yôtin@ni` and `yôtin@v`; or following any other scheme they desire, so long as the assignments remain relatively stable. Any homograph disambiguator should start with `@`, as there is code in morphodict to redirect an attempt to access an invalid homograph disambiguator like `/word/nîmiw@foo` into a search for `nîmiw`, whereas something like `nîmiw-foo` will do a search for `nîmiw-foo` instead of `nîmiw`. That way if disambiguators change due to adding and removing entries/definitions, old links should still be useful to people. - The `paradigm` field, an optional string, is the name of the paradigm used to display the paradigm tables for the entry. This may be a static or dynamic paradigm. This field may have a null value, or be omitted entirely, if the entry is indeclinable. - The `analysis` field, an optional array, is the analysis of the headword. This field is used to populate the linguistic breakdown popup shown by the blue ℹ️ icon. The required format is that of an entry from the list returned by `lookup_lemma_with_affixes`, e.g., `[[], "nîmiw", ["+V", "+AI", "+Ind", "+3Sg"]]`. The format is: - Array of: - Array of FST tag strings - FST lemma string - Array of FST tag strings This field may have a null value, or be omitted entirely, if the entry is unanalyzable. - The `linguistInfo` field, an optional arbitrary JSON object, allows extra presentation data to be stored in the database. Morphodict HTML templates have access to this data for the purpose of displaying additional data to the end user. Morphodict does not use any of this data for its core language-independent functionality. It is recommended *not* to put any unused data in here that ‘might be handy later,’ but only to add new things here when required as part of a coordinated effort with the frontend code to add new user-facing features. - The `fstLemma` field, an optional string, is the FST lemma to use when generating dynamic paradigm tables for unanalyzable forms in dictionaries that support that. It must not be specified when there is also an `analysis` field. To be clear on the concept that we’re talking about: the FST lemma is the thing that gets plugged into a paradigm layout template to generate associated wordforms. For example, if the head is `nîminâniwan` with FST analysis `nîmiw+V+AI+Ind+X`, then the FST lemma is `nîmiw`. If the head is `nimîw` with the FST analysis `nimîw+V+AI+Ind+3Sg`, then the FST lemma is `nimîw`, which is the same as the head. Normally the FST lemma is included as part of the `analysis`, and the code can retrieve the conceptual FST lemma from that `analysis`, so the separate `fstLemma` field is redundant and should not be explicitly included. However, sometimes it is desirable to have a dictionary entry that is not analyzable but for which dynamic paradigms should be displayed. For example, in Arapaho, one entry has the non-analyzable stem `níhooyóó-` as a head, which should display dynamic paradigms using the FST lemma `nihooyoo`. Therefore, that is precisely what the `fstLemma` field for this entry contains. In that case of dynamic paradigms for non-analyzable `head` entries, and only in that case, is this field useful. This field is only supported for languages with the `MORPHODICT_ENABLE_FST_LEMMA_SUPPORT` setting enabled, which is currently only Arapaho. Note that the only strictly required fields are `head`, `slug`, and `senses`. If no other fields are supplied, morphodict will still work, but many interesting and useful features of morphodict will not; you will essentially have a static dictionary application. #### formOf entries These entries add additional definitions to inflected forms of normal entries. In the morphodict application, formOf entries are described as being a ‘form of’ the corresponding normal entry. Some linguists feel that, lexicographically, these entries may be more appropriate as standalone normal entries, on the grounds that having a distinct definition implies being a distinct lexeme. Others argue that there is room for certain inflected forms to have their own connotations and shades of meaning, even in English, but especially in morphologically complex languages. The fields are: - The `formOf` field, a string, must equal the `slug` of a normal entry in the same importjson file. - The `head` field has the same format and meaning as for normal entries. - The `senses` field has the same format and meaning as for normal entries. - The `analysis` field has the same format and meaning as for normal entries, but with the additional condition that the formOf `analysis` FST lemma must equal the FST lemma of the corresponding normal entry. No other fields are valid on formOf entries. #### Validation The following importjson validation checks are intended to be implemented for morphodict: - Every entry must have at least one non-empty definition, and every definition must have at least one valid source. - If an `analysis` is specified, it must be one of the results returned from doing an FST lookup on the `head` - If an `analysis` is specified, the `head` must be one of the results returned by running the `analysis` through the generator FST - `fstLemma` may not be specified if there is an `analysis` - The FST lemma in every `formOf` analysis must match the FST lemma of the corresponding normal entry - Strings must not begin with a combining character. If a string is intended to start with a diacritic, e.g., a floating tone such as `"´a"`, or `" ̣gwà…"`, use a non-combining character such as `´`, or if there is no non-combining equivalent such as for Combining Dot Below, put the combining character on a space, a non-breaking space, or a U+25CC ◌ Dotted Circle. - The `slug` must not contain certain URL-unsafe characters, e.g., `/` #### Caveats Known issues with the importjson format: - In many dictionaries, the order of definitions is very important, with the most common definitions being listed first. There is currently no code in morphodict to explicitly store or preserve the order of definitions. The dictionary is currently largely working by coincidence in that the import process and database tend to show the same results to the user as what was in the importjson file. This isn’t so much an issue with the input format, which does have an explicit definition order, as a warning that this order may not be preserved. (where_dictionary_files_go)= ## Where do files go? Each full dictionary for language pair [`sssttt`](sssttt) is intended to be placed at src/${sssttt}/resources/dictionaries/${sssttt}_dictionary.importjson There is a `.gitignore` rule to prevent accidentally committing them. (building_test_dictionary_data)= ### Building test dictionary data Test dictionaries are created by taking subsets of the full dictionaries, and storing them beside the full dictionaries as `${sssttt}_test_db.importjson`. *These files are checked in so that people can do development and testing without having any access to the full dictionary files which have restricted distribution.* To update them, you’ll need a copy of the full dictionary file. 1. Edit `src/${sssttt}/resources/dictionary/test_db_words.txt` 2. Run `./${sssttt}-manage buildtestimportjson` to extract the words mentioned in `test_db_words.txt`, from `${sssttt}_dictionary.importjson` into `${sssttt}_test_db.importjson` 3. Commit your changes and send a PR. *Exception: the current crkeng test database omits many unused keys, e.g., `wordclass_emoji`, that currently exist in the production `crkeng_dictionary.importjson` file.* (current_dictionary_data)= ## Current dictionary data In theory, linguists will provide comprehensive and correct dictionary data in the morphodict-specific importjson format. In practice, at this time, full dictionaries for each language arise as follows: - For Plains Cree, there is `crk/dicts/crkeng_dictionary.importjson` checked in to the secret git repository at `altlab.dev:/data/altlab.git`. This is what’s used in production. It was created by importing the old `crkeng.xml` file into an older version of the software that did a lot of paradigm and analysis inference during import, and then the database contents were exported in the new importjson format. - For Woods Cree, the `UAlbertaALTLab/munge:/cwdeng/cwdize` script transliterates the production `crkeng_dictionary.importjson` file, using the `database.ndjson` file from the ALTLab repo to get the proto-Cree forms. - For Arapaho, the [private arp-db repo](https://github.com/UAlbertaALTLab/arp-db) has `arapaho_lexicon.json`, and the `UAlbertaALTLab/munge:/arpeng/toimportjson.ts` script transforms that to importjson. - For Tsuut’ina, there’s a deprecated spreadsheet on Google Drive. Get the link from `srs/README.md` in the ALTLab repo, as I’m not sure it’s intended to be public. Download the single-tab spreadsheet as a .tsv file, and `UAlbertaALTLab/munge:/srseng/toimportjson.ts` will transform that to importjson. None of these are publicly available as the creators of the source content have not given permission to make them publicly available in that form. For Arapaho we believe we could make the data public, but have not yet had sufficiently official confirmation of that. ### Building To install the prerequisites of the munge scripts: git clone git@github.com:UAlbertaALTLab/munge.git cd munge npm install Then, running them takes a little bit of fiddling because they are written in TypeScript and need to be transpiled to JavaScript. To do that on the fly: node -r sucrase/register/ts cwdeng/cwdize.ts --help In several of the directories there is an executable `run.js` script to do that for you, so it could be as simple as `./run.js --help`. ## Cree `linguistInfo` For the Plains Cree dictionary, the following `linguistInfo` fields are used to display linguistic info in search results, to provide semantic class information, and for showing emoji: - `inflectional_category`, String: The inflectional category for an entry, with hyphen, e.g., `NI-1`. (CW's `\ps`) - `pos`, String: The part of speech for this entry (`N` / `V` / `PRON`). If we were naming this today following our glossary, we would call it the *general word class*. - `rw_domains`, list of String: The RapidWords semantic classification domain names for this entry, in the canonical form defined in both rapidwords.net and semdom.org: e.g. `[ "Sleep" ]` - `rw_indices`, dictionary mapping String to a list of String: For each of the sources in the entry (using the same short abbreviations as in `source`), we provide the list of indices for the RapidWords semantic classification domains for the entry, in the canonical form defined in both rapidwords.net and semdom.org: e.g.: ``` { "CW": [ "5.7.1" ] } ``` - `stem`, String: The FST stem for this entry. For Plains Cree specifically, there are two variants of linguistic stems in the ALTLab crk-db. For both, a preceding hyphen (for dependent nouns, e.g. *-ohkom-*) and/or following hyphen (for all stems, e.g. *nimî-*) indicate that they can take additional prefixes/suffixes: - the minimal CW stem from `\stm` field in the CW toolbox source, which N.B. is lacking from MD and AECD. It should be present there for all words, but blank for non-independent morphemes, and might be a list when the head is a phrase. This minimal stem may lack lexicalized reduplicative elements and/or preverbs/prenouns, and thus may not have a one-to-one mapping to possible lemmas (e.g. *api-* as the minimal CW stem of the lemma *ay-apiw*. - the full FST stem according to the `fst.stem` field in the ALTLab crk-db. This includes all the reduplicative elements as well as preverbs/prenouns which have become lexicalized in a lemma, and thus has a one-to-one mapping with the lemma. This is created in the ALTLab crk-db based on the minimal stem, e.g. *ay-api-* as the full FST stem for the lemma *ay-apiw*. The FST stem, when supplemented with special morphophonological symbols, is used in the lexc source code for crk stems in the format: <`FST lemma`>:<`FST stem`>, e.g., [`acitakotêw:acitakot3 VTAt ;`][fst-stem1] itwêwina currently has the FST stem in the `linguistInfo.stem` field, and does not include a separate CW stem in the importjson. If display of the minimal CW stem were some day added to morphodict, that would of course require the dictionary data to include that data at that time. - `wn_domains`, a list of String: The WordNet semantic classifications for this entry, using the same format as in the Altlab wordnet server, e.g., `[ "(v) sleep#1", "(adv) together#4" ]`. - `wordclass`, String: The word class for this entry (`VTA` / `VAI` / etc.). At one time our glossary called this a *specific word class*. [FST-stem1]: https://github.com/giellalt/lang-crk/blob/8574d2b163d115e6da4419794f21ffe692d76b9b/src/fst/stems/verb_stems.lexc#L123