The job I'm sabotaging at the moment involves machine learning for probabilistic models. Such are necessary when the problem cannot be solved deterministically either due to high computational complexity or due to lack of our knowledge of the actual structure of the problem. One example of the latter case is the natural language, which is quite complex and inherently ambiguous. Linguists, psycholinguists and cognitive scientists keep generating and checking numerous hypotheses on the subject, and yet there's no single comprehensive model of the language, no algorithm to pass the Turing test. So we stick to small subtasks we can manage, and even there we generally design a simplistic model reflecting our idea of the underlying structure and leave a number of degrees of freedom in the model to be consequently adjusted to fit the data. The process of adjustment is called training. The idea is simple -- we provide a number of sample inputs and try to find parameter values which allow the model to best reproduce the correct outputs. The examples used for training are called the training set.
Sounds simple, yeah. But there's a number of catches, of course. Firstly, the performance of the model on the training set is only partially indicative of its expected performance on the other, unseen data. Besides, depending on the structure of the model, the training can be an extremely computationally intensive task -- training can usually be seen as an optimization process, not necessarily convex and usually in a high-dimensional space. Finally, the more complex the problem -- the bigger training set we need. And training data may be expensive, since it often takes a skilled linguist or other specialist to produce it.
Funny, really. Most humans are exposed to only a limited number of training examples and still manage to learn well from them. We still don't exactly know how.
пятница, 21 января 2011 г.
Подписаться на:
Комментарии к сообщению (Atom)
Комментариев нет:
Отправить комментарий