From 2b5d6dc08079aa2f2ec7b55eb5fc2998fb34be7c Mon Sep 17 00:00:00 2001 From: A1exRey Date: Tue, 30 Oct 2018 13:00:35 +0300 Subject: [PATCH 01/16] Create quiz-1-response.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Добавлен ответ на тест --- .../quizzes/quiz-1/quiz-1-response.md | 51 +++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 2018-komp-ling/quizzes/quiz-1/quiz-1-response.md diff --git a/2018-komp-ling/quizzes/quiz-1/quiz-1-response.md b/2018-komp-ling/quizzes/quiz-1/quiz-1-response.md new file mode 100644 index 00000000..714bc52c --- /dev/null +++ b/2018-komp-ling/quizzes/quiz-1/quiz-1-response.md @@ -0,0 +1,51 @@ + + +
+ +# Quiz 1 + +1. Which problems does maxmatch suffer from? (Choose all that + apply.) + + a) requires comprehensive dictionary + + d) constructs non-grammatical sentences + +2. Write a perl/sed substitution with regular expressions that + adds whitespace for segmentation around "/" in "either/or" + expressions but not around fractions "1/2": + Answer: + + sed 's/[[:alpha:]][/][[:alpha:]]/ \/ /' + +3. the text mentions several times that machine learning + techniques produce better segmentation than rule-based + systems; what are some downsides of machine learning + techniques compared to rule-based? + Answer: + 1) model overfitting + 2) Mono-language + 3) Impossibility of interpretation + +4. write a sentence (in English or in Russian) which maxmatch + segments incorrectly. + Answer: + + При правовых вопросах + + Приправовыхвопросах + + Приправ о вы х вопросах + + +5. what are problems for sentence segmentation? provide one + example in English or Russian for each that applies. + + a) ambiguous abbrevations with punctuation + + c) sentences lacking separating punctuation + +
From fb67124ce613e9c769745371a14f91f4ce0f0578 Mon Sep 17 00:00:00 2001 From: A1exRey Date: Tue, 30 Oct 2018 13:40:37 +0300 Subject: [PATCH 02/16] added segmentaion report --- .../segmentation/segmentation-response.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) create mode 100644 2018-komp-ling/practicals/segmentation/segmentation-response.md diff --git a/2018-komp-ling/practicals/segmentation/segmentation-response.md b/2018-komp-ling/practicals/segmentation/segmentation-response.md new file mode 100644 index 00000000..fa2dc210 --- /dev/null +++ b/2018-komp-ling/practicals/segmentation/segmentation-response.md @@ -0,0 +1,19 @@ + + +
+ +

Обзор двух библиотек по токенизированию предложений

+ +В данном отчете было использованы две библиотеки по токенеизорванию спредложений из текста: pragmatic segmenter (Ruby) и NLTK (Python). В качестве тестового текста был использован кусок дампа русской Википедии. +

Pragmatic segmenter (Ruby)

+Pragmatic segmenter - это бибилотека для Ruby, основанная на правилах. При парсинге русской википедии данная библиотека показала качество ниже среднего. Большинство сокращений, инициалы имен и т.д. неправильно были разделены на предложения. +В общем библиотека больше расчитана на языки латинского алфавита. + +

NLTK (Python)

+ +sent_tokenize() - это функция библиотеки NLTK по определению границ предложения. Но на самом деле это алгоритм машинного обучения без учителя, который можно обучить самомстоятельно. В бибилотеке NLTK уже есть набор pre-trained моделей, в том числе и для русского языка. В общем данная библиотека показала себя лучше, чем Ruby. Большинство сокращений и инициалов выделены правильно, единтсвенную проблему составляет сокращения с пробелами внутри. + +
From 5f7229aace713700997c36d77e010b29659f8c01 Mon Sep 17 00:00:00 2001 From: A1exRey Date: Mon, 12 Nov 2018 21:39:13 +0300 Subject: [PATCH 03/16] =?UTF-8?q?=D0=94=D0=BE=D0=B1=D0=B0=D0=B2=D0=BB?= =?UTF-8?q?=D0=B5=D0=BD=D0=BE=20=D0=BE=D0=BF=D0=B8=D1=81=D0=B0=D0=BD=D0=B8?= =?UTF-8?q?=D0=B5=20=D0=B4=D0=BB=D1=8F=20=D1=82=D1=80=D0=B0=D0=BD=D1=81?= =?UTF-8?q?=D0=BB=D0=B8=D1=82=D0=B5=D1=80=D0=B0=D1=86=D0=B8=D0=B8?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../practicals/transliteration-response.md | 42 +++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 2018-komp-ling/practicals/transliteration-response.md diff --git a/2018-komp-ling/practicals/transliteration-response.md b/2018-komp-ling/practicals/transliteration-response.md new file mode 100644 index 00000000..6532626d --- /dev/null +++ b/2018-komp-ling/practicals/transliteration-response.md @@ -0,0 +1,42 @@ +# Practical 2: Transliteration (engineering) + +
+ +## Questions +What to do with ambiguous letters ? For example, Cyrillic `е' could be either je or e. + +Can you think of a way that you could provide mappings from many characters to one character ? + For example sh → ш or дж → c ? +How might you make different mapping rules for characters at the beginning or end of the string ? + + +### Правила для транслитерации + +Основная идея - это начинать транслитерацию со сложных, многобуквенных преобразований (ч - tch). Например: +>Шарик -- sh-арик -- sharik + +Далее нужно заменить все гласные в начале и в конце слова (Я - ya). +>яблоко -- ya-блоко -- yabloko + +После чего уже можно переходить на простые однобуквенные преобразования (у - u) +>мед -- med + +## Методы +### Кодировка-декодировка с помощью KOI-8R + +Транслитерация с помощью кодировки KOI-8R - не самый эффективный метод транслитерации текста. +Но он обеспечивает некоторые особенности, которые не доступны другим способам: + + 1)Возможность восстановить первоначальный текст + + 2)Правила кодировки уже заданы + +Метод транслитерации с помощью KOI-8R представлен в файле transliterate_koi8r.py + +### Кодировка-декодировка с помощью правил + +Правила для транслетерации находятся в файле rules.txt. + В нем заданы правила для согланых, гласных, а также для гласных в начале слова. + + +
From 9bed2a385c85c82a66cee86f8c26e6338482e71d Mon Sep 17 00:00:00 2001 From: A1exRey Date: Mon, 12 Nov 2018 21:41:36 +0300 Subject: [PATCH 04/16] Create transliteration-response.md --- .../transliteration-response.md | 42 +++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 2018-komp-ling/practicals/transliteration/transliteration-response.md diff --git a/2018-komp-ling/practicals/transliteration/transliteration-response.md b/2018-komp-ling/practicals/transliteration/transliteration-response.md new file mode 100644 index 00000000..6532626d --- /dev/null +++ b/2018-komp-ling/practicals/transliteration/transliteration-response.md @@ -0,0 +1,42 @@ +# Practical 2: Transliteration (engineering) + +
+ +## Questions +What to do with ambiguous letters ? For example, Cyrillic `е' could be either je or e. + +Can you think of a way that you could provide mappings from many characters to one character ? + For example sh → ш or дж → c ? +How might you make different mapping rules for characters at the beginning or end of the string ? + + +### Правила для транслитерации + +Основная идея - это начинать транслитерацию со сложных, многобуквенных преобразований (ч - tch). Например: +>Шарик -- sh-арик -- sharik + +Далее нужно заменить все гласные в начале и в конце слова (Я - ya). +>яблоко -- ya-блоко -- yabloko + +После чего уже можно переходить на простые однобуквенные преобразования (у - u) +>мед -- med + +## Методы +### Кодировка-декодировка с помощью KOI-8R + +Транслитерация с помощью кодировки KOI-8R - не самый эффективный метод транслитерации текста. +Но он обеспечивает некоторые особенности, которые не доступны другим способам: + + 1)Возможность восстановить первоначальный текст + + 2)Правила кодировки уже заданы + +Метод транслитерации с помощью KOI-8R представлен в файле transliterate_koi8r.py + +### Кодировка-декодировка с помощью правил + +Правила для транслетерации находятся в файле rules.txt. + В нем заданы правила для согланых, гласных, а также для гласных в начале слова. + + +
From 0f4e3e17d4e1e80d5f08234306b1e1f01a69252b Mon Sep 17 00:00:00 2001 From: A1exRey Date: Mon, 12 Nov 2018 21:41:54 +0300 Subject: [PATCH 05/16] Delete transliteration-response.md --- .../practicals/transliteration-response.md | 42 ------------------- 1 file changed, 42 deletions(-) delete mode 100644 2018-komp-ling/practicals/transliteration-response.md diff --git a/2018-komp-ling/practicals/transliteration-response.md b/2018-komp-ling/practicals/transliteration-response.md deleted file mode 100644 index 6532626d..00000000 --- a/2018-komp-ling/practicals/transliteration-response.md +++ /dev/null @@ -1,42 +0,0 @@ -# Practical 2: Transliteration (engineering) - -
- -## Questions -What to do with ambiguous letters ? For example, Cyrillic `е' could be either je or e. - -Can you think of a way that you could provide mappings from many characters to one character ? - For example sh → ш or дж → c ? -How might you make different mapping rules for characters at the beginning or end of the string ? - - -### Правила для транслитерации - -Основная идея - это начинать транслитерацию со сложных, многобуквенных преобразований (ч - tch). Например: ->Шарик -- sh-арик -- sharik - -Далее нужно заменить все гласные в начале и в конце слова (Я - ya). ->яблоко -- ya-блоко -- yabloko - -После чего уже можно переходить на простые однобуквенные преобразования (у - u) ->мед -- med - -## Методы -### Кодировка-декодировка с помощью KOI-8R - -Транслитерация с помощью кодировки KOI-8R - не самый эффективный метод транслитерации текста. -Но он обеспечивает некоторые особенности, которые не доступны другим способам: - - 1)Возможность восстановить первоначальный текст - - 2)Правила кодировки уже заданы - -Метод транслитерации с помощью KOI-8R представлен в файле transliterate_koi8r.py - -### Кодировка-декодировка с помощью правил - -Правила для транслетерации находятся в файле rules.txt. - В нем заданы правила для согланых, гласных, а также для гласных в начале слова. - - -
From 5dd808654a72165fc2b629db4e5ab4bb3f46979f Mon Sep 17 00:00:00 2001 From: A1exRey Date: Mon, 12 Nov 2018 21:43:51 +0300 Subject: [PATCH 06/16] =?UTF-8?q?=D0=94=D0=BE=D0=BC=D0=B0=D1=88=D0=BD?= =?UTF-8?q?=D0=B5=D0=B5=20=D0=B7=D0=B0=D0=B4=D0=B0=D0=BD=D0=B8=D0=B5=202?= =?UTF-8?q?=20--=20=D1=82=D1=80=D0=B0=D0=BD=D1=81=D0=BB=D0=B8=D1=82=D0=B5?= =?UTF-8?q?=D1=80=D0=B0=D1=86=D0=B8=D1=8F?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../practicals/transliteration/rank.py | 44 ++++++++++++++ .../practicals/transliteration/rules.txt | 36 ++++++++++++ .../transliteration/transliterate.py | 58 +++++++++++++++++++ .../transliteration/transliterate_koi8r.py | 49 ++++++++++++++++ 4 files changed, 187 insertions(+) create mode 100644 2018-komp-ling/practicals/transliteration/rank.py create mode 100644 2018-komp-ling/practicals/transliteration/rules.txt create mode 100644 2018-komp-ling/practicals/transliteration/transliterate.py create mode 100644 2018-komp-ling/practicals/transliteration/transliterate_koi8r.py diff --git a/2018-komp-ling/practicals/transliteration/rank.py b/2018-komp-ling/practicals/transliteration/rank.py new file mode 100644 index 00000000..476ab50c --- /dev/null +++ b/2018-komp-ling/practicals/transliteration/rank.py @@ -0,0 +1,44 @@ +#!/usr/bin/python + +import sys, getopt + +def main(argv): + inputfile = '' + outputfile = 'ranked.txt' + try: + opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="]) + except getopt.GetoptError: + print('test.py -i -o ') + sys.exit(2) + for opt, arg in opts: + if opt == '-h': + print( 'test.py -i -o ') + sys.exit() + elif opt in ("-i", "--ifile"): + inputfile = arg + elif opt in ("-o", "--ofile"): + outputfile = arg + print( 'Input file is:', inputfile) + print( 'Output file is:', outputfile) + + freq = [] + with open(inputfile, 'r',encoding='utf8') as fd: + for line in fd.readlines(): + line = line.strip('\n') + (f, w) = line.split('\t') + freq.append((int(f), w)) + rank = 1 + min = freq[0][0] + ranks = [] + for i in range(0, len(freq)): + if freq[i][0] < min: + rank = rank + 1 + min = freq[i][0] + ranks.append((rank, freq[i][0], freq[i][1])) + + with open(outputfile, 'w+',encoding='utf8') as fd: + for w in vocab: + fd.write(ranks) + +if __name__ == "__main__": + main(sys.argv[1:]) diff --git a/2018-komp-ling/practicals/transliteration/rules.txt b/2018-komp-ling/practicals/transliteration/rules.txt new file mode 100644 index 00000000..a3bdc243 --- /dev/null +++ b/2018-komp-ling/practicals/transliteration/rules.txt @@ -0,0 +1,36 @@ + я ya + ю yu + е ye +а a +б b +в v +г g +д d +е e +ё yo +ж zsh +з z +и i +й y +к k +л l +м m +н n +о o +п p +р r +с s +т t +у u +ф f +х h +ц ts +ч tch +ш ch +щ scsh +ъ ' +ы uy +ь ' +э a +ю u +я a diff --git a/2018-komp-ling/practicals/transliteration/transliterate.py b/2018-komp-ling/practicals/transliteration/transliterate.py new file mode 100644 index 00000000..73974730 --- /dev/null +++ b/2018-komp-ling/practicals/transliteration/transliterate.py @@ -0,0 +1,58 @@ +#!/usr/bin/python + +import sys, getopt + +def main(argv): + inputfile = '' + outputfile = '__translitareted.conllu' + try: + opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="]) + except getopt.GetoptError: + print('test.py -i -o ') + sys.exit(2) + for opt, arg in opts: + if opt == '-h': + print( 'test.py -i -o ') + sys.exit() + elif opt in ("-i", "--ifile"): + inputfile = arg + elif opt in ("-o", "--ofile"): + outputfile = arg + print( 'Input file is:', inputfile) + print( 'Output file is:', outputfile) + + vocab = [] + test = open(inputfile,'r',encoding='utf8') + for line in test.readlines(): + if '\t' not in line: + continue + row = line.replace('\n','').split('\t') + if len(row) != 10: + continue + vocab.append(row) + test.close() + + for i in enumerate(vocab): + try: + bar = i[1][1] + for i in enumerate(bar): + if i[0] == 0: + try: + bar = bar.replace(i[1],rules_for[' '+i[1]]) + except: + bar = bar.replace(i[1],rules_for[i[1]]) + else: + try: + bar = bar.replace(i[1],rules_for[i[1]]) + except: + pass + vocab[i[0]][9] = bar + except: + continue + + with open(outputfile, 'w+',encoding='utf8') as fd: + for w in vocab: + fd.write('\t'.join(w)+'\n') + +if __name__ == "__main__": + main(sys.argv[1:]) diff --git a/2018-komp-ling/practicals/transliteration/transliterate_koi8r.py b/2018-komp-ling/practicals/transliteration/transliterate_koi8r.py new file mode 100644 index 00000000..fe6799f1 --- /dev/null +++ b/2018-komp-ling/practicals/transliteration/transliterate_koi8r.py @@ -0,0 +1,49 @@ +#!/usr/bin/python + +import sys, getopt + +def main(argv): + inputfile = '' + outputfile = '__translitareted.conllu' + try: + opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="]) + except getopt.GetoptError: + print('test.py -i -o ') + sys.exit(2) + for opt, arg in opts: + if opt == '-h': + print( 'test.py -i -o ') + sys.exit() + elif opt in ("-i", "--ifile"): + inputfile = arg + elif opt in ("-o", "--ofile"): + outputfile = arg + print( 'Input file is:', inputfile) + print( 'Output file is:', outputfile) + + vocab = [] + test = open(inputfile,'r',encoding='utf8') + for line in test.readlines(): + if '\t' not in line: + continue + row = line.replace('\n','').split('\t') + if len(row) != 10: + continue + vocab.append(row) + test.close() + + #magic KOI-8r + for i in enumerate(vocab): + try: + oldone = i[1][1].encode('koi8-r') + newone = ''.join([chr(c & 0x7F) for c in oldone]) + vocab[i[0]][9] = newone + except: + continue + + with open(outputfile, 'w+',encoding='utf8') as fd: + for w in vocab: + fd.write('\t'.join(w)+'\n') + +if __name__ == "__main__": + main(sys.argv[1:]) From 85cd857f296120bb0ec454c2e8bdfb8f5fd2023d Mon Sep 17 00:00:00 2001 From: A1exRey Date: Mon, 10 Dec 2018 21:48:19 +0300 Subject: [PATCH 07/16] =?UTF-8?q?=D0=94=D0=BE=D0=B1=D0=B0=D0=B2=D0=BB?= =?UTF-8?q?=D0=B5=D0=BD=D1=8B=20=D0=BE=D1=82=D0=B2=D0=B5=D1=82=D1=8B=20?= =?UTF-8?q?=D0=BD=D0=B0=20=D1=82=D0=B5=D1=81=D1=823?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../quizzes/quiz3/quiz-3-response.md | 65 +++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 2018-komp-ling/quizzes/quiz3/quiz-3-response.md diff --git a/2018-komp-ling/quizzes/quiz3/quiz-3-response.md b/2018-komp-ling/quizzes/quiz3/quiz-3-response.md new file mode 100644 index 00000000..916a0618 --- /dev/null +++ b/2018-komp-ling/quizzes/quiz3/quiz-3-response.md @@ -0,0 +1,65 @@ + + +
+ +# Quiz 3 + +1. In the reading, it is claimed that to implement a morphological disambiguator for an unseen language, it takes roughly the same amount of time whether annotating a corpus to train on versus writing constraint grammar rules. + + a) Give an argument for why constraint grammar rules are more valuable + + Constraint grammar rules gives us a great precision score, but a recall score is low most of the time. + So, if constraint rules can be simple implemented – we should use constraint grammar. + + b) Give an argument for why corpus annotation and HMM training is more valuable + + Counterwise, HMM gives us a great Recall score (bigger than CG rules), + but HMM will never reach the Precision level of CG rules + + +2. Can the two systems be used together? Explain. + + Yes. The basis of the grammar is composed of constraint rules. Yet, when rules + cannot provide a solution, there is room for the use of elements that contain + probabilistic features; this contributes to robustness in the grammar. + So, first of all we should use CG rules, and after that – HMM. + +3. Give a sentence with morphosyntactic ambiguity. +What would you expect a disambiguator to do in this situation? What can you do? + + +‘Косой косой косил косой’ + +In this case disambiguator will give us + + [A,A,V,N] + +because for a verb there must be a NOUN. +But the real PoS tags are + + [A,N,V,N] + +4. Choose several (>2) quantities that evaluate the quality of a morphological disambiguator, +and describe how to compute them. Describe what it would mean to have disambiguators which +differ in quality based on which quantity is used. + + Difficulty in evaluate the proper score for FP,FN e.t.c is what we need to summarize + all the answers for every tag we have. Example: take all NOUN tag from golden standard + and from the answers of our model. If : + +• Standard and answers equal = TP + +• Standard is NOUN, but answer is not = FN + +• Answer is NOUN, but standard is not = FP + +• Not Standard, nor answer is NOUN = TN + +Next, we summarize all answers and calculate Precision and Recall. + +5. Give an example where an n-gram HMM performs better than a unigram HMM tagger. + +
From f0f2ccf7f9478a66d9f525e8a59b0de42bdbc704 Mon Sep 17 00:00:00 2001 From: A1exRey Date: Sun, 24 Mar 2019 16:21:01 +0300 Subject: [PATCH 08/16] HW 4 is done --- .../practicals/unigra_tagger/unigram.md | 68 +++++++++++++++++++ 1 file changed, 68 insertions(+) create mode 100644 2018-komp-ling/practicals/unigra_tagger/unigram.md diff --git a/2018-komp-ling/practicals/unigra_tagger/unigram.md b/2018-komp-ling/practicals/unigra_tagger/unigram.md new file mode 100644 index 00000000..b370d769 --- /dev/null +++ b/2018-komp-ling/practicals/unigra_tagger/unigram.md @@ -0,0 +1,68 @@ +# matplotlib + +Получение рангов для нашего текста + +```` +import matplotlib.pyplot as plt + +freq = [] +ranks = [] + +#load data +with open('./../freq.txt', 'r') as f: + f = f.readlines() +for line in f: + line = line.strip('\n') + (f, w) = line.split('\t') + freq.append((int(f), w)) + +freq.sort(reverse=True) + +#ranking data +rank = 1 +min = freq[0][0] +for i in range(0, len(freq)): + if freq[i][0] < min: + rank +=1 + min = freq[i][0] + ranks.append([rank, freq[i][0], freq[i][1]]) + +#do the plots +x = [] +y = [] +for line in ranks: + row = line + x.append(int(row[0])) + y.append(int(row[1])) +plt.plot(x, y, 'b*') +plt.show() +```` +# ElementTree + +### How would you get just the Icelandic line and the gloss line ? + +```` +for tier in root.findall('.//tier'): + if tier.attrib['id'] == 'n': + for item in tier.findall('.//item'): + if item.attrib['tag'] != 'T': # here is the condition + print(item.text) +```` + +# scikit learn + +### Perceptron answers +```` +- #хоругвь# incorrect class: 0 correct class: 1 +- #обувь# incorrect class: 0 correct class: 1 +- #морковь# incorrect class: 0 correct class: 1 +- #бровь# incorrect class: 0 correct class: 1 +- #церковь# incorrect class: 0 correct class: 1 +0.982857142857142856 +```` +To improve the quiality of our model we should use MLP, or deeper (than 1 layer) models + +# Screenscraping + +done in __screencap.py__ + From 153a22aa893faa37a65b1b6f04609b0f2555f7ed Mon Sep 17 00:00:00 2001 From: A1exRey Date: Sun, 24 Mar 2019 16:21:16 +0300 Subject: [PATCH 09/16] Add files via upload --- .../practicals/unigra_tagger/screencap.py | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 2018-komp-ling/practicals/unigra_tagger/screencap.py diff --git a/2018-komp-ling/practicals/unigra_tagger/screencap.py b/2018-komp-ling/practicals/unigra_tagger/screencap.py new file mode 100644 index 00000000..61ff9c6d --- /dev/null +++ b/2018-komp-ling/practicals/unigra_tagger/screencap.py @@ -0,0 +1,43 @@ +#дерево +import sys + +def strip_html(h): + output = '' + inTag = False + for c in h: + if c == '<': + inTag = True + continue + if c == '>': + inTag = False + continue + if not inTag: + output += c + return output + +stem = '_' +zkod = '_' +ipa = '_' + + +h1 = '_' +for line in sys.stdin.readlines(): + line = line.strip() + text = strip_html(line) + if line.count('

') > 0: + h1 = strip_html(line) + if h1 != 'Русский': + continue + if text.count('Корень:') > 0: + stem = text.split(':')[1].split(';')[0] + if text.count('МФА') > 0: + ipa = text.split(';')[3].split('&')[0] + if text.count('тип склонения') > 0: + zkod = text.split('тип склонения')[1].strip().split(' ')[0].strip("^") + + +if stem != '_' and zkod != '_' and ipa != '_': + print('%s\t%s\t%s' % (stem, zkod, ipa)) + stem = '_' + zkod = '_' + ipa = '_' \ No newline at end of file From f710b9996ee80520e1332717a4b4fd503423bcdd Mon Sep 17 00:00:00 2001 From: A1exRey Date: Tue, 26 Mar 2019 15:48:15 +0300 Subject: [PATCH 10/16] HW 4 Pletenev --- .../Unigram_part_of_speech_tagger_response.md | 68 +++++++++++++++++++ 1 file changed, 68 insertions(+) create mode 100644 2018-komp-ling/practicals/Unigram part-of-speech tagger/Unigram_part_of_speech_tagger_response.md diff --git a/2018-komp-ling/practicals/Unigram part-of-speech tagger/Unigram_part_of_speech_tagger_response.md b/2018-komp-ling/practicals/Unigram part-of-speech tagger/Unigram_part_of_speech_tagger_response.md new file mode 100644 index 00000000..b370d769 --- /dev/null +++ b/2018-komp-ling/practicals/Unigram part-of-speech tagger/Unigram_part_of_speech_tagger_response.md @@ -0,0 +1,68 @@ +# matplotlib + +Получение рангов для нашего текста + +```` +import matplotlib.pyplot as plt + +freq = [] +ranks = [] + +#load data +with open('./../freq.txt', 'r') as f: + f = f.readlines() +for line in f: + line = line.strip('\n') + (f, w) = line.split('\t') + freq.append((int(f), w)) + +freq.sort(reverse=True) + +#ranking data +rank = 1 +min = freq[0][0] +for i in range(0, len(freq)): + if freq[i][0] < min: + rank +=1 + min = freq[i][0] + ranks.append([rank, freq[i][0], freq[i][1]]) + +#do the plots +x = [] +y = [] +for line in ranks: + row = line + x.append(int(row[0])) + y.append(int(row[1])) +plt.plot(x, y, 'b*') +plt.show() +```` +# ElementTree + +### How would you get just the Icelandic line and the gloss line ? + +```` +for tier in root.findall('.//tier'): + if tier.attrib['id'] == 'n': + for item in tier.findall('.//item'): + if item.attrib['tag'] != 'T': # here is the condition + print(item.text) +```` + +# scikit learn + +### Perceptron answers +```` +- #хоругвь# incorrect class: 0 correct class: 1 +- #обувь# incorrect class: 0 correct class: 1 +- #морковь# incorrect class: 0 correct class: 1 +- #бровь# incorrect class: 0 correct class: 1 +- #церковь# incorrect class: 0 correct class: 1 +0.982857142857142856 +```` +To improve the quiality of our model we should use MLP, or deeper (than 1 layer) models + +# Screenscraping + +done in __screencap.py__ + From 77ea44d19436979f076fd70f16b283954e9d45e5 Mon Sep 17 00:00:00 2001 From: A1exRey Date: Tue, 26 Mar 2019 15:49:09 +0300 Subject: [PATCH 11/16] HW 4 Pletenev --- .../screencap.py | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 2018-komp-ling/practicals/Unigram part-of-speech tagger/screencap.py diff --git a/2018-komp-ling/practicals/Unigram part-of-speech tagger/screencap.py b/2018-komp-ling/practicals/Unigram part-of-speech tagger/screencap.py new file mode 100644 index 00000000..9421d8c5 --- /dev/null +++ b/2018-komp-ling/practicals/Unigram part-of-speech tagger/screencap.py @@ -0,0 +1,43 @@ +#дерево +import sys + +def strip_html(h): + output = '' + inTag = False + for c in h: + if c == '<': + inTag = True + continue + if c == '>': + inTag = False + continue + if not inTag: + output += c + return output + +stem = '_' +zkod = '_' +ipa = '_' + + +h1 = '_' +for line in sys.stdin.readlines(): + line = line.strip() + text = strip_html(line) + if line.count('

') > 0: + h1 = strip_html(line) + if h1 != 'Русский': + continue + if text.count('Корень:') > 0: + stem = text.split(':')[1].split(';')[0] + if text.count('МФА') > 0: + ipa = text.split(';')[3].split('&')[0] + if text.count('тип склонения') > 0: + zkod = text.split('тип склонения')[1].strip().split(' ')[0].strip("^") + + +if stem != '_' and zkod != '_' and ipa != '_': + print('%s\t%s\t%s' % (stem, zkod, ipa)) + stem = '_' + zkod = '_' + ipa = '_' From 2fc793abd219c50da8735131c4b4d51ed8d3867c Mon Sep 17 00:00:00 2001 From: A1exRey Date: Tue, 26 Mar 2019 15:50:26 +0300 Subject: [PATCH 12/16] Delete unigram.md --- .../practicals/unigra_tagger/unigram.md | 68 ------------------- 1 file changed, 68 deletions(-) delete mode 100644 2018-komp-ling/practicals/unigra_tagger/unigram.md diff --git a/2018-komp-ling/practicals/unigra_tagger/unigram.md b/2018-komp-ling/practicals/unigra_tagger/unigram.md deleted file mode 100644 index b370d769..00000000 --- a/2018-komp-ling/practicals/unigra_tagger/unigram.md +++ /dev/null @@ -1,68 +0,0 @@ -# matplotlib - -Получение рангов для нашего текста - -```` -import matplotlib.pyplot as plt - -freq = [] -ranks = [] - -#load data -with open('./../freq.txt', 'r') as f: - f = f.readlines() -for line in f: - line = line.strip('\n') - (f, w) = line.split('\t') - freq.append((int(f), w)) - -freq.sort(reverse=True) - -#ranking data -rank = 1 -min = freq[0][0] -for i in range(0, len(freq)): - if freq[i][0] < min: - rank +=1 - min = freq[i][0] - ranks.append([rank, freq[i][0], freq[i][1]]) - -#do the plots -x = [] -y = [] -for line in ranks: - row = line - x.append(int(row[0])) - y.append(int(row[1])) -plt.plot(x, y, 'b*') -plt.show() -```` -# ElementTree - -### How would you get just the Icelandic line and the gloss line ? - -```` -for tier in root.findall('.//tier'): - if tier.attrib['id'] == 'n': - for item in tier.findall('.//item'): - if item.attrib['tag'] != 'T': # here is the condition - print(item.text) -```` - -# scikit learn - -### Perceptron answers -```` -- #хоругвь# incorrect class: 0 correct class: 1 -- #обувь# incorrect class: 0 correct class: 1 -- #морковь# incorrect class: 0 correct class: 1 -- #бровь# incorrect class: 0 correct class: 1 -- #церковь# incorrect class: 0 correct class: 1 -0.982857142857142856 -```` -To improve the quiality of our model we should use MLP, or deeper (than 1 layer) models - -# Screenscraping - -done in __screencap.py__ - From b8217766d34f7febba442fbe671c55af43967648 Mon Sep 17 00:00:00 2001 From: A1exRey Date: Tue, 26 Mar 2019 15:50:36 +0300 Subject: [PATCH 13/16] Delete screencap.py --- .../practicals/unigra_tagger/screencap.py | 43 ------------------- 1 file changed, 43 deletions(-) delete mode 100644 2018-komp-ling/practicals/unigra_tagger/screencap.py diff --git a/2018-komp-ling/practicals/unigra_tagger/screencap.py b/2018-komp-ling/practicals/unigra_tagger/screencap.py deleted file mode 100644 index 61ff9c6d..00000000 --- a/2018-komp-ling/practicals/unigra_tagger/screencap.py +++ /dev/null @@ -1,43 +0,0 @@ -#дерево -import sys - -def strip_html(h): - output = '' - inTag = False - for c in h: - if c == '<': - inTag = True - continue - if c == '>': - inTag = False - continue - if not inTag: - output += c - return output - -stem = '_' -zkod = '_' -ipa = '_' - - -h1 = '_' -for line in sys.stdin.readlines(): - line = line.strip() - text = strip_html(line) - if line.count('

') > 0: - h1 = strip_html(line) - if h1 != 'Русский': - continue - if text.count('Корень:') > 0: - stem = text.split(':')[1].split(';')[0] - if text.count('МФА') > 0: - ipa = text.split(';')[3].split('&')[0] - if text.count('тип склонения') > 0: - zkod = text.split('тип склонения')[1].strip().split(' ')[0].strip("^") - - -if stem != '_' and zkod != '_' and ipa != '_': - print('%s\t%s\t%s' % (stem, zkod, ipa)) - stem = '_' - zkod = '_' - ipa = '_' \ No newline at end of file From 4d9f5874f9e3b85450f0a1b6aff8feb504df9559 Mon Sep 17 00:00:00 2001 From: A1exRey Date: Fri, 29 Mar 2019 21:56:39 +0300 Subject: [PATCH 14/16] HW 5 Pletenev --- .../xrenner_practical/xrenner-response.md | 34 +++++++++++++++++++ 1 file changed, 34 insertions(+) create mode 100644 2018-komp-ling/practicals/xrenner_practical/xrenner-response.md diff --git a/2018-komp-ling/practicals/xrenner_practical/xrenner-response.md b/2018-komp-ling/practicals/xrenner_practical/xrenner-response.md new file mode 100644 index 00000000..4534f8de --- /dev/null +++ b/2018-komp-ling/practicals/xrenner_practical/xrenner-response.md @@ -0,0 +1,34 @@ +# Xrenner Response + +So, the first - we must make all preparation, like install the xrenner etc: +The standart model for english language works just fine: + + $ python3 xrenner.py -m eng -o html example_in.conll10 > /mnt/c/sub_wsl/example.html + +Next, we must make our language model (in this case - russian). We can make this model from scratch, or copy a meta-language folder: + + $ cp -R ./models/udx ./models/rus + +### Rules + +Add some rules to our model. + +__pronouns.tab:__ + + я 1sg + мы 1pl + он male + она fema + его male + её fema + меня 1sg + нас 1pl + +__coref.tab:__ + + Рабиндранат Тагор|Тагор coref + +This simple rules give us a great results (see: pushkin.html): + + $ python3 xrenner.py -m rus -o html pushkin.conllu > /mnt/c/sub_wsl/pushkin.html + From 74eddc6573e58e9a94a7ae5c6b4a533dadc1437c Mon Sep 17 00:00:00 2001 From: A1exRey Date: Fri, 29 Mar 2019 21:57:33 +0300 Subject: [PATCH 15/16] HW 5 Pletenev --- .../practicals/xrenner_practical/example.html | 540 ++++++++++++++++++ .../practicals/xrenner_practical/pushkin.html | 133 +++++ 2 files changed, 673 insertions(+) create mode 100644 2018-komp-ling/practicals/xrenner_practical/example.html create mode 100644 2018-komp-ling/practicals/xrenner_practical/pushkin.html diff --git a/2018-komp-ling/practicals/xrenner_practical/example.html b/2018-komp-ling/practicals/xrenner_practical/example.html new file mode 100644 index 00000000..b3d25489 --- /dev/null +++ b/2018-komp-ling/practicals/xrenner_practical/example.html @@ -0,0 +1,540 @@ + + + + + + + + + + +
+New +Zealand +
+begins +process +to +consider +changing +national +flag +design +
+Thursday +
+, +
+May +7 +, +2015 +
+On +Tuesday +, +
+the +
+New +Zealand +
+government +
+announced +the +start +of +
+a +public +process +to +suggest +
+designs +for +
+a +new +national +flag +
+
+, +and +determine +whether +
+their +
+citizens +would +prefer +a +different +national +flag +over +
+the +current +one +
+
+. +
+The +current +flag +of +
+New +Zealand +
+
+. +
+The +current +
+New +Zealand +
+flag +
+is +partially +based +on +
+the +United +Kingdom +'s +flag +
+; +
+the +new +one +
+would +be +unique +to +
+New +Zealand +
+. +
+The +government +'s +
+Flag +
+Consideration +Project +
+has +planned +a +number +of +conferences +and +roadshows +as +part +of +
+this +process +
+, +with +the +first +meeting +set +to +take +place +in +Christchurch +on +May +16 +. +According +to +the +
+New +Zealand +
+Herald +, +
+
+Emeritus +Professor +John +Burrows +
+, +
+the +chairman +of +the +
+project +
+'s +panel +of +twelve +
+, +
+said +
+New +Zealand +
+'s +flag +has +never +before +been +open +to +public +choice +. +
+Professor +Burrows +
+also +said +resources +and +kits +would +be +accessible +for +schools +and +communities +, +" +For +example +, +
+schools +
+can +run +
+their +
+own +flag +discussions +and +referendums +to +mirror +the +formal +process +as +part +of +
+their +
+own +learning +exercise +" +. +
+People +
+were +encouraged +to +submit +
+their +
+designs +online +at +www.flag.govt.nz +and +suggest +what +
+the +flag +
+should +mean +on +www.standfor.co.nz +. +
+Names +of +participants +
+would +be +engraved +, +at +
+their +
+option +, +on +a +flag +pole +monument +to +be +built +in +
+the +nation +'s +capital +
+, +
+Wellington +
+. +
+New +Zealand +'s +Prime +Minister +John +Key +
+said +
+he +
+believes +redesigning +
+the +flag +
+now +has +a +" +strong +rationale +" +. +
+Mr +Key +
+promoted +the +campaign +for +a +unique +
+New +Zealand +
+flag +on +
+Waitangi +Day +
+- +
+February +6 +
+- +this +year +. +Of +the +public +process +, +
+he +
+said +, +" +In +the +end +
+I +
+'ll +have +one +vote +in +each +referendum +just +like +every +other +New +Zealander +on +the +electoral +roll +" +. +
+The +
+New +Zealand +
+government +
+intends +to +hold +two +referendums +to +reach +a +verdict +on +
+the +flag +
+, +at +an +estimated +cost +of +NZ +$ +26 +million +, +although +a +recent +poll +found +only +a +quarter +of +citizens +favoured +changing +
+the +flag +
+. +This +is +a +decrease +from +the +year +before +, +when +it +was +forty +percent +. +The +first +referendum +is +to +be +held +from +November +20 +to +December +11 +, +selecting +a +single +new +flag +design +out +of +about +four +finalists +. +
+Voters +
+would +then +choose +between +
+the +new +flag +
+and +
+
+their +
+current +flag +
+early +in +2016 +. + + + \ No newline at end of file diff --git a/2018-komp-ling/practicals/xrenner_practical/pushkin.html b/2018-komp-ling/practicals/xrenner_practical/pushkin.html new file mode 100644 index 00000000..c9fe846a --- /dev/null +++ b/2018-komp-ling/practicals/xrenner_practical/pushkin.html @@ -0,0 +1,133 @@ + + + + + + + + + + +Однажды +
+Пушкин +
+написал +
+письмо +
+
+
+Рабиндранату +
+Тагору +
+. +" +
+Дорогой +далекий +друг +
+, +— +писал +
+он +
+, +— +
+я +
+
+Вас +
+не +знаю +, +и +
+Вы +
+
+меня +
+не +знаете +. +Очень +хотелось +бы +познакомиться +. +Всего +хорошего +. +
+Саша +
+" +. +Когда +
+письмо +
+принесли +, +
+Тагор +
+предавался +
+самосозерцанию +
+. +Так +погрузился +, +
+хоть +режь +
+его +
+
+. +
+
+Его +
+жена +
+толкала +, +толкала +, +
+письмо +
+подсовывала +— +не +видит +. +
+Он +
+, +правда +, +по-русски +читать +не +умел +. +Так +и +не +познакомились +. + + + \ No newline at end of file From 48231e94313daa243c77c9f002688ef02e7b47b6 Mon Sep 17 00:00:00 2001 From: A1exRey Date: Tue, 2 Apr 2019 14:58:50 +0300 Subject: [PATCH 16/16] Pletenev Sergey homeworks --- .../Unigram-part-of-speech-tagger-response.md | 68 +++++++++++++++++++ .../practicals/segmentation-response.md | 19 ++++++ .../practicals/transliteration-response.md | 42 ++++++++++++ 2018-komp-ling/practicals/xrenner-response.md | 34 ++++++++++ 4 files changed, 163 insertions(+) create mode 100644 2018-komp-ling/practicals/Unigram-part-of-speech-tagger-response.md create mode 100644 2018-komp-ling/practicals/segmentation-response.md create mode 100644 2018-komp-ling/practicals/transliteration-response.md create mode 100644 2018-komp-ling/practicals/xrenner-response.md diff --git a/2018-komp-ling/practicals/Unigram-part-of-speech-tagger-response.md b/2018-komp-ling/practicals/Unigram-part-of-speech-tagger-response.md new file mode 100644 index 00000000..b370d769 --- /dev/null +++ b/2018-komp-ling/practicals/Unigram-part-of-speech-tagger-response.md @@ -0,0 +1,68 @@ +# matplotlib + +Получение рангов для нашего текста + +```` +import matplotlib.pyplot as plt + +freq = [] +ranks = [] + +#load data +with open('./../freq.txt', 'r') as f: + f = f.readlines() +for line in f: + line = line.strip('\n') + (f, w) = line.split('\t') + freq.append((int(f), w)) + +freq.sort(reverse=True) + +#ranking data +rank = 1 +min = freq[0][0] +for i in range(0, len(freq)): + if freq[i][0] < min: + rank +=1 + min = freq[i][0] + ranks.append([rank, freq[i][0], freq[i][1]]) + +#do the plots +x = [] +y = [] +for line in ranks: + row = line + x.append(int(row[0])) + y.append(int(row[1])) +plt.plot(x, y, 'b*') +plt.show() +```` +# ElementTree + +### How would you get just the Icelandic line and the gloss line ? + +```` +for tier in root.findall('.//tier'): + if tier.attrib['id'] == 'n': + for item in tier.findall('.//item'): + if item.attrib['tag'] != 'T': # here is the condition + print(item.text) +```` + +# scikit learn + +### Perceptron answers +```` +- #хоругвь# incorrect class: 0 correct class: 1 +- #обувь# incorrect class: 0 correct class: 1 +- #морковь# incorrect class: 0 correct class: 1 +- #бровь# incorrect class: 0 correct class: 1 +- #церковь# incorrect class: 0 correct class: 1 +0.982857142857142856 +```` +To improve the quiality of our model we should use MLP, or deeper (than 1 layer) models + +# Screenscraping + +done in __screencap.py__ + diff --git a/2018-komp-ling/practicals/segmentation-response.md b/2018-komp-ling/practicals/segmentation-response.md new file mode 100644 index 00000000..fa2dc210 --- /dev/null +++ b/2018-komp-ling/practicals/segmentation-response.md @@ -0,0 +1,19 @@ + + +
+ +

Обзор двух библиотек по токенизированию предложений

+ +В данном отчете было использованы две библиотеки по токенеизорванию спредложений из текста: pragmatic segmenter (Ruby) и NLTK (Python). В качестве тестового текста был использован кусок дампа русской Википедии. +

Pragmatic segmenter (Ruby)

+Pragmatic segmenter - это бибилотека для Ruby, основанная на правилах. При парсинге русской википедии данная библиотека показала качество ниже среднего. Большинство сокращений, инициалы имен и т.д. неправильно были разделены на предложения. +В общем библиотека больше расчитана на языки латинского алфавита. + +

NLTK (Python)

+ +sent_tokenize() - это функция библиотеки NLTK по определению границ предложения. Но на самом деле это алгоритм машинного обучения без учителя, который можно обучить самомстоятельно. В бибилотеке NLTK уже есть набор pre-trained моделей, в том числе и для русского языка. В общем данная библиотека показала себя лучше, чем Ruby. Большинство сокращений и инициалов выделены правильно, единтсвенную проблему составляет сокращения с пробелами внутри. + +
diff --git a/2018-komp-ling/practicals/transliteration-response.md b/2018-komp-ling/practicals/transliteration-response.md new file mode 100644 index 00000000..6532626d --- /dev/null +++ b/2018-komp-ling/practicals/transliteration-response.md @@ -0,0 +1,42 @@ +# Practical 2: Transliteration (engineering) + +
+ +## Questions +What to do with ambiguous letters ? For example, Cyrillic `е' could be either je or e. + +Can you think of a way that you could provide mappings from many characters to one character ? + For example sh → ш or дж → c ? +How might you make different mapping rules for characters at the beginning or end of the string ? + + +### Правила для транслитерации + +Основная идея - это начинать транслитерацию со сложных, многобуквенных преобразований (ч - tch). Например: +>Шарик -- sh-арик -- sharik + +Далее нужно заменить все гласные в начале и в конце слова (Я - ya). +>яблоко -- ya-блоко -- yabloko + +После чего уже можно переходить на простые однобуквенные преобразования (у - u) +>мед -- med + +## Методы +### Кодировка-декодировка с помощью KOI-8R + +Транслитерация с помощью кодировки KOI-8R - не самый эффективный метод транслитерации текста. +Но он обеспечивает некоторые особенности, которые не доступны другим способам: + + 1)Возможность восстановить первоначальный текст + + 2)Правила кодировки уже заданы + +Метод транслитерации с помощью KOI-8R представлен в файле transliterate_koi8r.py + +### Кодировка-декодировка с помощью правил + +Правила для транслетерации находятся в файле rules.txt. + В нем заданы правила для согланых, гласных, а также для гласных в начале слова. + + +
diff --git a/2018-komp-ling/practicals/xrenner-response.md b/2018-komp-ling/practicals/xrenner-response.md new file mode 100644 index 00000000..4534f8de --- /dev/null +++ b/2018-komp-ling/practicals/xrenner-response.md @@ -0,0 +1,34 @@ +# Xrenner Response + +So, the first - we must make all preparation, like install the xrenner etc: +The standart model for english language works just fine: + + $ python3 xrenner.py -m eng -o html example_in.conll10 > /mnt/c/sub_wsl/example.html + +Next, we must make our language model (in this case - russian). We can make this model from scratch, or copy a meta-language folder: + + $ cp -R ./models/udx ./models/rus + +### Rules + +Add some rules to our model. + +__pronouns.tab:__ + + я 1sg + мы 1pl + он male + она fema + его male + её fema + меня 1sg + нас 1pl + +__coref.tab:__ + + Рабиндранат Тагор|Тагор coref + +This simple rules give us a great results (see: pushkin.html): + + $ python3 xrenner.py -m rus -o html pushkin.conllu > /mnt/c/sub_wsl/pushkin.html +