Sunday, 31 August 2014

Semi-supervised learning : Major varieties of learning problem

Semi-supervised learning : Major varieties of learning problem
Machine learning focuses on five main types of learning problems, with the first four falling under the category of function estimation. These problems can be grouped based on two dimensions: whether the learning task is supervised or unsupervised and whether the variable to be predicted is nominal or real-valued.

The first type of problem is classification, which involves supervised learning of a function f(x) that predicts a nominal value. The function learned is called a classifier, and it determines the class to which an instance x belongs based on its input. For example, the task might involve classifying a word in a sentence based on its part of speech. The learner is given labeled data, which includes instances along with their correct class labels. Using this data, the classifier learns to make predictions for new instances.

The concept of clustering is the unsupervised equivalent to classification. In clustering, the goal is also to assign instances to classes, but the algorithm only has access to the instances themselves, not the correct answers for any of them. The primary difference between classification and clustering is the type of data that is provided to the learner as input, specifically whether it is labeled or not. Two other important function estimation tasks include regression, where the learner estimates a function that takes on real values instead of finite values, and unsupervised learning of a real-valued function, which can be seen as density estimation. In this case, the learner is given an unlabeled set of training data and is tasked with learning a function that assigns a real value to every point in the space. Finally, reinforcement learning is another type of learning where the learner receives a stream of data from sensors and is expected to take actions based on this data. There is also a reward signal that the learner tries to maximize over time. The key differences between reinforcement learning and the other four function estimation settings are the sequential nature of the inputs and the indirect nature of the supervision provided by the reward signal.
 
Semisupervised learning is a form of machine learning that combines elements of both supervised and unsupervised learning. The distinction between these two approaches lies in whether or not the training data is labeled, with supervised learning relying on labeled data to classify and predict outcomes, while unsupervised learning seeks to discover patterns and structure within unlabeled data. In contrast, semisupervised learning involves providing some labeled data to the learner, while leaving the rest unlabeled. This mixed setting is the canonical case for semisupervised learning, and many methods have been developed to take advantage of it.

However, labeled and unlabeled data are not the only ways of providing partial information to the learner about the labels for training data. For instance, a few reliable rules for labeling instances or constraints limiting the candidate labels for specific instances could also be used. These alternative methods of partial labeling are also relevant to semisupervised learning and are often used in practice. While reinforcement learning could also be seen as a form of semisupervised learning because it relies on indirect information about labels, the connection between reinforcement learning and other semisupervised approaches is not well understood and is beyond the scope of this discussion.

Introduction of Semi-supervised Learning for Computational Linguistic

Introduction of Semi-supervised Learning for Computational Linguistic


Creating sufficient labeled data can be very time-consuming. Obtaining the output sequences is not difficult: English texts are available in great quantity. What is time-consum
Introduction of Semi-supervised Learning for Computational Linguistic
Advancements in computational linguistics have resulted in the creation of different algorithms for semisupervised learning, including the Yarowsky algorithm which has gained prominence. These algorithms have been developed specifically to tackle the common problems that arise in computational linguistics. Such problems involve scenarios where there is a correct linguistic answer, a large amount of unlabeled data, and very limited labeled data. In contrast to acoustic modeling, classic unsupervised learning is not suitable for these problems because any way of assigning classes is not acceptable. Although the learning method is mostly unsupervised since most of the data is unlabeled, labeled data is essential because it provides the only characterization of the linguistically correct classes.

The algorithms just mentioned turn out to be very similar to an older learning method known as self-training that was unknown in computational linguistics at the time. For this reason, it is more accurate to say that they were rediscovered, rather than invented, by computational linguists. Until very recently, most prior work on semisupervised learning has been little known even among researchers in the area of machine learning. One goal of the present volume is to make the prior and also the more recent work on semisupervised learning more accessible to computational linguists.

Shortly after the rediscovery of self-training in computational linguistics, a method called co-training was invented by Blum and Mitchell, machinelearning researchers working on text classification. Self-training and co-training have become popular and widely employed in computational linguistics; together they account for all but a fraction of the work on semisupervised learning in the field. We will discuss them in the next chapter. In the remainder of this chapter, we give a broader perspective on semisupervised learning, and lay out the plan of the rest of the book.

Motivation of Semi-supervised Learning


For most learning tasks of interest, it is easy to obtain samples of unlabeled data. For many language learning tasks, for example, the World Wide Web can be seen as a large collection of unlabeled data. By contrast, in most cases, the only practical way to obtain labeled data is to have subject-matter experts manually annotate the data, an expensive and time-consuming process.

The great advantage of unsupervised learning, such as clustering, is that it requires no labeled training data. The disadvantage has already been mentioned: under the best of circumstances, one might hope that the learner would recover the correct clusters, but hardly that it could correctly label the clusters. In many cases, even the correct clusters are too much to hope for. To say it another way, unsupervised learning methods rarely perform well if evaluated by the same yardstick used for supervised learners. If we expect a clustering algorithm to predict the labels in a labeled test set, without the advantage of labeled training data, we are sure to be disappointed.

The advantage of supervised learning algorithms is that they do well at the harder task: predicting the true labels for test data. The disadvantage is that they only do well if they are given enough labeled training data, but producing sufficient quantities of labeled data can be very expensive in manual effort. The aim of semisupervised learning is to have our cake and eat it, too. Semisupervised learners take as input unlabeled data and a limited source of label information, and, if successful, achieve performance comparable to that of supervised learners at significantly reduced cost in manual production of training data.

We intentionally used the vague phrase “a limited source of label information.” One source of label information is obviously labeled data, but there are alternatives. We will consider at least the following sources of label information:
  • labeled data
  • a seed classifier
  • limiting the possible labels for instances without determining a unique label
  • constraining pairs of instances to have the same, but unknown, label (co-training)
  • intrinsic label definitions
  • a budget for labeling instances selected by the learner (active learning)

The goal of unsupervised learning in computational linguistics is to enable autonomous systems to learn natural language without the need for explicit instruction or manual guidance. However, the ultimate objective is not merely to uncover interesting language structure but to acquire the correct target language. This may seem daunting since learning a target language without labeled data appears implausible. 
 
Nevertheless, semisupervised learning, which combines unsupervised and supervised learning methods, may offer a starting point. By using unsupervised learning to acquire a small amount of labeled data, semisupervised learning can potentially extend this to a complete solution. This process seems to resemble human language acquisition, where bootstrapping refers to the initial acquisition of language through explicit instruction, and distributional regularities of linguistic forms play a crucial role in extending this to the entirety of the language. Semisupervised learning methods thus provide a possible explanation for how the initial kernel of language is extended in human language acquisition.

Supervised and unsupervised training with Hidden Markov Models

Hidden Markov Models in Supervised and unsupervised training


Supervised and unsupervised training with Hidden Markov Models
Church and DeRose utilized Hidden Markov Models (HMMs) in their computational linguistics work, which were originally developed for speech recognition. HMMs are probabilistic models that generate sequences of states and parallel sequences of output symbols. In language modeling, the output symbols represent sentences in natural language. The automaton defined by an HMM can be in several distinct states, and it starts by randomly selecting a state. The automaton then emits a symbol, chooses a new state, and repeats the process. Each choice is stochastic and made randomly based on a distribution over output symbols or next states, depending on the type of choice and the current state.

The model consists of numeric values representing the probability of choosing a specific transition or emission. Learning an HMM is simple if labeled data is provided, which pairs state sequences with output sequences. To estimate the probability of choosing a particular value when making a stochastic choice of a specific type, one can count how often that choice was made in the labeled data.

Church and DeRose used HMMs to tackle part-of-speech tagging by associating the states of the automaton with parts of speech. The automaton generates a sequence of parts of speech and emits a word for each part of speech, resulting in a tagged text where each word is annotated with its corresponding part of speech. Supervised learning of an HMM for part-of-speech tagging is effective, with HMM taggers for English having an error rate of around 3.5 to 4 percent. The success of these models in part-of-speech tagging was what initially drew attention to probabilistic models in computational linguistics.

Probabilistic methods in computational linguistics

Computational linguistics : Probabilistic methods

Probabilistic methods in linguistics
Computational linguistics is a field that aims to develop techniques for processing human languages through automatic means. Since the introduction of electronic computers in the late 1940s, researchers have been interested in machine translation, which was one of the earliest topics to attract attention. The development of computers was inspired by the idea of creating a thinking machine, or machina sapiens, and language was considered a uniquely human cognitive ability. Early work on artificial intelligence pitted symbolic reasoning against stochastic systems like neural nets. However, it soon became apparent that a solid probabilistic foundation was necessary to deal with uncertainty.

In computational linguistics, the belief in the adequacy of grammatical and logical constraints, supplemented by ad hoc heuristics, persisted for a long time. However, when the field acknowledged the importance of probabilistic methods, the shift was rapid and significant. The emergence of statistical part-of-speech tagging in 1988, which was described in papers by Church and DeRose, can be seen as the beginning of this awareness.

Prior to the papers by Church and DeRose, stochastic methods for part of speech disambiguation had been proposed, but they had not gained much prominence in computational linguistics. However, the papers by Church and DeRose had a profound impact on the field and reshaped it within a decade. At the time, one of the major challenges in natural language processing was the fragility of manually constructed systems. Ambiguity resolution, portability, and robustness were the primary issues. Semantic constraints were often too loose, leading to numerous viable analyses, or too strict, ruling out the correct analysis. Hence, there was a need for automatic methods that could soften constraints and resolve ambiguities. Portability required the adaptation of systems to variability across application domains, and robustness demanded the handling of errorful input and incomplete grammars. These challenges necessitated the use of automatic learning methods, which explains why probabilistic methods and machine learning, in particular, gained rapid penetration. Nowadays, computational linguistics is inseparable from machine learning.

Wednesday, 6 August 2014

Harmoni Cinta : an album by Gita Gutawa

Harmoni Cinta

Harmoni Cinta is an album that was released by Gita Gutawa in 2009 under Sony Music Indonesia. The album's purpose was to contribute a portion of the sales to help underprivileged students receive an education. It was created through a collaboration between Gutawa and several Indonesian musicians, such as her father Erwin, as well as renowned artists such as Melly Goeslaw and Glenn Fredly. The production process took around nine months, starting in June 2008 and ending in March 2009. Dick Lee, a songwriter from Singapore, also contributed to the album by composing the song "Remember." Harmoni Cinta includes a cover of the title song from Chrisye's Aku Cinta Dia album.
Harmoni Cinta

The recording of Harmoni Cinta involved several locations around the world. The vocals were recorded in Jakarta, while the orchestral pieces were recorded separately by the City of Prague Philharmonic Orchestra and the Sofia Symphonic Orchestra in their respective cities. The album was mixed in two different studios, with six songs mixed at 301 Studio in Sydney and the remaining six mixed at Aluna Studio in Jakarta. "Aku Cinta Dia" was later mastered in New York at Sterling Sound Mastering. This collaborative effort resulted in work on the album taking place across four different continents.

Gita Gutawa played a bigger role in the creation of Harmoni Cinta compared to her first self-titled album. She was involved in determining the album's overall concept, selecting the songs to include, and even wrote five of them. The album itself is a combination of light and enjoyable teen pop with orchestrated classic pop. Like her previous album, Gita Gutawa explored themes such as young love, friendship, family ties, and worldliness.

One of the album's tracks, "Parasit," is a story of puppy love between two pre-teens, with references to biology, physics, and geography, including the Sahara Desert, Antarctica, and outer space. Another track, "Harmoni Cinta," features extravagant orchestral backing, while "Mau Tapi Malu" has a more coquettish tone with Gita singing alongside Mey Chan and Maia Estianty. "Remember" is a bilingual track with both English and Indonesian lyrics, and traditional instruments are used in its arrangement. The remaining tracks, "Selamat Datang Cinta," "Meraih Mimpi," "Lullaby," and "When You Wish Upon a Star," are slower and more minimalistic in style.

1996 Thomas Cup

1996 Thomas Cup

1996 Thomas Cup

The 1996 Thomas & Uber Cup was a major international badminton tournament held in Indonesia. It was the 19th tournament of the Thomas Cup and the 16th tournament of the Uber Cup. The tournament was held to determine the best teams in the world of badminton, and it attracted top players from all over the globe.

The press conference for the 1996 Thomas Cup was held in the Bank Rakyat Indonesia's building in Sentra BRI complex in Sudirman, Central Jakarta. The press conference was led by Putera Sampoerna, the chairman of PT HM Sampoerna Tbk, which is the manufacturer of A Mild, the fifth largest cigarette brand in Indonesia. A Mild was also the main sponsor of the 1996 TUC, and as such, Putera Sampoerna played a significant role in the organization of the tournament.

The opening and closing ceremony of the 1996 TUC were also led by Putera Sampoerna, as A Mild was the main sponsor of the tournament. The ceremonies were grand affairs, with dazzling displays of fireworks and cultural performances.

The Indonesian Thomas & Uber Cup Squads were the reigning champions in both the Thomas Cup and the Uber Cup, having won the titles for the third time. The squads were united in their quest for victory, and they were determined to defend their titles against tough competition from the world's best badminton players.

Overall, the 1996 Thomas & Uber Cup was a highly anticipated event in the world of badminton, and it showcased some of the best players in the sport. The tournament was a great success, thanks in large part to the efforts of Putera Sampoerna and the other organizers who worked tirelessly to make it a memorable event for all involved.

Indonesia's Thomas Cup team consisted of some of the best players in the world at the time, including Joko Suprianto, Hermawan Susanto, and Rexy Mainaky. The team was led by the legendary coach, Liem Swie King, who had previously led Indonesia to Thomas Cup victories in 1978 and 1980.

In the final of the Thomas Cup, Indonesia faced the formidable Korean team, who had won the Cup in 1992. The final was held in Hong Kong, and was a tense and closely contested affair. Indonesia took an early lead, winning the first two singles matches, but Korea fought back to level the scores at 2-2. The deciding match was the second doubles, which pitted the Indonesian pairing of Ricky Subagja and Rexy Mainaky against Korea's Ha Tae-kwon and Kang Kyung-jin. In a thrilling match that lasted over an hour, Subagja and Mainaky prevailed, winning 15-8, 12-15, 15-12, to secure Indonesia's third Thomas Cup title.

The Indonesian Uber Cup team was also strong, featuring players such as Susi Susanti, Mia Audina, and Finarsih. They faced the tough Chinese team in the final, who were the defending champions. The final was held in Hong Kong, and was another closely fought contest. China took an early lead, winning the first singles match, but Indonesia fought back to level the scores at 1-1. The deciding match was the second singles, which pitted Susi Susanti against China's Tang Yongshu. In a tense match, Susanti prevailed, winning 11-8, 11-9, to secure Indonesia's third Uber Cup title.

The victories of both the Thomas Cup and Uber Cup teams were celebrated widely in Indonesia, and the players and coach Liem Swie King were hailed as national heroes. The 1996 TUC was a major milestone for Indonesian badminton, and helped to cement the country's reputation as a dominant force in world badminton.