Introduction of Semi-supervised Learning for Computational Linguistic
Creating sufficient labeled data can be very time-consuming. Obtaining the output sequences is not difficult: English texts are available in great quantity. What is time-consum
Advancements in computational linguistics have resulted in the creation of different algorithms for semisupervised learning, including the Yarowsky algorithm which has gained prominence. These algorithms have been developed specifically to tackle the common problems that arise in computational linguistics. Such problems involve scenarios where there is a correct linguistic answer, a large amount of unlabeled data, and very limited labeled data. In contrast to acoustic modeling, classic unsupervised learning is not suitable for these problems because any way of assigning classes is not acceptable. Although the learning method is mostly unsupervised since most of the data is unlabeled, labeled data is essential because it provides the only characterization of the linguistically correct classes.
The algorithms just mentioned turn out to be very similar to an older learning method known as self-training that was unknown in computational linguistics at the time. For this reason, it is more accurate to say that they were rediscovered, rather than invented, by computational linguists. Until very recently, most prior work on semisupervised learning has been little known even among researchers in the area of machine learning. One goal of the present volume is to make the prior and also the more recent work on semisupervised learning more accessible to computational linguists.
Shortly after the rediscovery of self-training in computational linguistics, a method called co-training was invented by Blum and Mitchell, machinelearning researchers working on text classification. Self-training and co-training have become popular and widely employed in computational linguistics; together they account for all but a fraction of the work on semisupervised learning in the field. We will discuss them in the next chapter. In the remainder of this chapter, we give a broader perspective on semisupervised learning, and lay out the plan of the rest of the book.
Motivation of Semi-supervised Learning
For most learning tasks of interest, it is easy to obtain samples of unlabeled data. For many language learning tasks, for example, the World Wide Web can be seen as a large collection of unlabeled data. By contrast, in most cases, the only practical way to obtain labeled data is to have subject-matter experts manually annotate the data, an expensive and time-consuming process.
The great advantage of unsupervised learning, such as clustering, is that it requires no labeled training data. The disadvantage has already been mentioned: under the best of circumstances, one might hope that the learner would recover the correct clusters, but hardly that it could correctly label the clusters. In many cases, even the correct clusters are too much to hope for. To say it another way, unsupervised learning methods rarely perform well if evaluated by the same yardstick used for supervised learners. If we expect a clustering algorithm to predict the labels in a labeled test set, without the advantage of labeled training data, we are sure to be disappointed.
The advantage of supervised learning algorithms is that they do well at the harder task: predicting the true labels for test data. The disadvantage is that they only do well if they are given enough labeled training data, but producing sufficient quantities of labeled data can be very expensive in manual effort. The aim of semisupervised learning is to have our cake and eat it, too. Semisupervised learners take as input unlabeled data and a limited source of label information, and, if successful, achieve performance comparable to that of supervised learners at significantly reduced cost in manual production of training data.
We intentionally used the vague phrase “a limited source of label information.” One source of label information is obviously labeled data, but there are alternatives. We will consider at least the following sources of label information:
- labeled data
- a seed classifier
- limiting the possible labels for instances without determining a unique label
- constraining pairs of instances to have the same, but unknown, label (co-training)
- intrinsic label definitions
- a budget for labeling instances selected by the learner (active learning)
The goal of unsupervised learning in computational linguistics is to enable autonomous systems to learn natural language without the need for explicit instruction or manual guidance. However, the ultimate objective is not merely to uncover interesting language structure but to acquire the correct target language. This may seem daunting since learning a target language without labeled data appears implausible.
Nevertheless, semisupervised learning, which combines unsupervised and supervised learning methods, may offer a starting point. By using unsupervised learning to acquire a small amount of labeled data, semisupervised learning can potentially extend this to a complete solution. This process seems to resemble human language acquisition, where bootstrapping refers to the initial acquisition of language through explicit instruction, and distributional regularities of linguistic forms play a crucial role in extending this to the entirety of the language. Semisupervised learning methods thus provide a possible explanation for how the initial kernel of language is extended in human language acquisition.
No comments:
Post a Comment