Natural language processing greatly depends on a sufficient amount
of training data. When handling with smaller datasets, we can enrich our
data by analyzing the semantic structure of the language. In our thesis, we
will be working with valency. Valency carries information about the meaning
of a sentence. While valency is usually a feature of verbs, we can also observe
it in adjectives and nouns. Valency forms valency patterns around carriers.
In theory, each sense of the valency carrier should form a distinguishable
valency pattern. Valency patterns have a small feature space and are fit for
training machine learning algorithms. They contain enough information to
distinguish the sense of the valency carrier.
Our work is based on corpus ssj500k 2.1. Over half of the corpus
contains hand-annotated semantic roles from which we extracted valency
patterns. We built a program for listing and analyzing the valency patterns.
In theory, different verb senses form different valency patterns. We tested
a number of clustering algorithms on the corpus sentences. The goal was to
cluster the valency frames, based on similar senses, and to find sense specific
valency patterns. We implemented three versions of Lesk algorithm and two
versions of k-means algorithm. We used data from SloWNet and SSKJ for the knowledge based Lesk algorithms.
|