Community for collecting linguistic data - Meta

TopAnswers Meta

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

Community for collecting linguistic data

Community Proposal

add tag

निरंजन

Hello all,

I am a linguist and we need language data for our research. By data I mean sentences, words, audio clips of real speakers, [glossed sentences](https://en.wikipedia.org/wiki/Interlinear_gloss) and so on. Usually linguists ask there friends for the data from various languages, but it becomes very difficult for them to collect diverse data beyond a point. Also it is very repetitive to keep collecting the data for the same phenomena again and again. eg. If I want the word for "bag" in Vietnamese and somebody has already collected it at some point of time, but because of lack of "free and open resources" I need to find a Vietnamese speaker again to complete my research. It is not always easy.

As TopAnswers uses CC licenses, it would be amazingly helpful for linguists to collect and browse the data from a free and open online platform.

# What does this site need?

* A way to upload audio clips. Very essential!
* A way to attach video clips. (Especially for phonetic researches linguists might need to "see" the movement of the articulators.)
* As we have a special way of having syntax highlighting on programming forums, linguistic data forum would need a special syntax for showing [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) symbols. eg. \`hɛlo wəːld\` should show `hɛlo wəːld` with a good IPA font like Doulos SIL or Charis SIL.
* Some programming for interlinear glosses. Interlinear glosses need a table-like setting which can be seen in the following screenshot -

![Screenshot-77.png](/image?hash=3e12bc52de2cde15e6c73355768f57f980acc4b71e2b46c4923c055f1b37a197)

The first line is of natural pronunciation transcribed with a phonetic script, the second line uses glosses to describe the words from the first line, but as you can clearly see glosses are almost always longer than the actual words. Hence some extra space is left in the first line. The third line is of natural translation, which can be printed with simple formatting. One can imagine this to be a border-less table which has left aligned cells, but the third line should not be the part of this table. I would wish our normal code blocks to perform this task.

# Proposed input

```
# he pen̪ ɾam=ʦ-ə ahe
# $DEM$.$PROX$ pen.$N$ Ram=$GEN$-$AGR$ $AUX$
This pen is Ram's.
```

I have put dollars around the glosses for them to be printed as small caps letters. Lines which start with # will be the part of the table-like setting of glosses. Other lines will be printed normally. The above code should replicate the screenshot.

At first site this seems to be enough for a linguistic data forum. People might want to add to this. This forum would be fun to general users and a boon to linguists :smile:.

The structure of the forum would be anybody asks words or sentences for some English expression 'x' (to have a common meta-language) and anybody who feels to answer it, would just add a voice note of 'x' in their language, or if they know IPA, they might transcribe 'x'. Question can be focused (English to Vietnamese) or open ended (English to any language). It would be a rich resource of linguistic data.

# Sample question

What is the word for `bag` in your language?

# Sample answer

Hello, in Marathi we call it `piʃwi` which is written in Devanagari script as `पिशवी`.

In a sentence -

```
# मला पिशवी दे
# məla piʃwi d̪e
# $1$-$OBL$$=DAT$ bag give.$IMP$
Give me a bag.
```

Top Answer

claybrick

Here are some thoughts/ideas in a hopefully easier to follow form:
1. If the focus of the community is going to be on collecting data, it would make sense to gather the data in one place and in a more searchable form, possibly in a (semi)automatic way. If there isn't a good established way to do so the table format and the question types we choose should give a good hint on how this should be structured.

2. We would need different question types or tags(for example looking for a similar meaning or a similar construct). Guess this groups would roughly correspond to well established linguistic categories (lexicon, morphology, syntax, phraseology,...).

3. Some wiki posts could contain a list of other relevant online resources (categorized by topic, language, etc., but probably also by license for data sources).

4. It would be helpful to mark the answer by the type of source, in particular for written only sources and for primary answers from native speakers (the good part is that for many questions being a native speaker is all the expertise you need to give a useful answer).

5. It was suggested to add Unicode based spelling for specific languages, which is surely a good idea. For some languages this can be more difficult, but we could figure that out as they show up. Sticking to a standard form for a specific language would help avoiding confusion. Historical written sources can be particularly challenging and may deserve some special treatment and even plain images.

Recently I had a curiosity, which actually ended up being useful to illustrate some additional points. I was wondering if it's common to use swear words as a form of negation or as a substitute for "nothing" (or "anything" depending on how a language handles double negations).

6. Swear words can be very interesting in linguistic. Figuring out when their use is appropriate shouldn't really be a problem, specially with the special syntax. I still don't know how well will this interact with some safe browsing tools, page rankings and similar, so I'm refraining from giving examples at least here on meta. ~~Sometimes latin is or was used for the meaning of swear worlds, but I don't think this solution would be well suited or needed here~~ (using a polite form with the same literal meaning of the swear world and an indication such as vulgar or colloquial seems a better and well established way to deal with this). Naturally questions asking just how to swear in language X wouldn't be welcomed, but that's probably true for any question for which a simple dictionary would be a better tool.

7. It's mostly a curiosity. I find this kind of questions interesting, but this could dilute too much the already wide scope of the community. A dedicated question type would help, but allowing this kind of questions could still be distracting if there is a consistent user base that finds the community useful strictly for research purposes and gaining data.

8. A link to a relevant study/external resource would make a good answer to this question. This kind of answers would be probably useful also for data request questions, as long as they don't distract from what the asker actually needs and should be somehow distinguishable or left in the comment section.

9. If answered with examples, it would be difficult to group the examples on the base of an English meta sentence or some formal expression, some verbose explanation seems more suited. The same could apply for phenomena that simply don't occur in English.

10. Examples in non related languages are more significant. Since I observed the phenomenon in Italian, in Slovenian (where it could be partially a calque) and probably in English (not sure to what extent) and was wondering if it's a local or global phenomenon, answers from unrelated languages would be more significant. Sometimes cultural or genealogical relationships would have a different weight and the opposite requirement for related languages seems also plausible. This requirements can probably be just expressed in the question and less "desirable" answers would still be valid data, so this probably doesn't deserve special treatment.

11. The difference between the literal and figurative meaning is relevant (glossed sentences don't seem to cover that well, but I'm not familiar with them so I may be wrong) and being a swear world isn't a grammatical category. Some optional or question type specific fields could be handy.

Most of this problems/suggestions can be better addressed as the community grows, but having them written down won't hurt.

1 Answer