Posted: Oct 07, 2017 2:55 pm
by John Platko
:book:

and this is interesting.


Moreover, to reduce the odds of missing future reward, an optimal agent may trade the risk of immediate pain for information gain and thus forget faster after aversive conditioning. A simple neuronal network reproduces these features. Our theory shows that forgetting in Drosophila appears as an optimal adaptive behavior in a changing environment. This is in line with the view that forgetting is adaptive rather than a consequence of limitations of the memory system.


They seem to understand how the future can cause effects in the present. :nod:


Basic model of decision making in a changing environment

For our model we assumed a simplified scenario where the conditioning pertains directly to the appetitive reaction. In particular, depending on the state of the environment, approaching the odor can lead to reward () or punishment () but it can also result in no reinforcement () (Fig. 1). Fleeing the odor, i.e the aversive reaction, never leads to reinforcement (). An agent (fruit fly), whose goal is to maximize reinforcement, chooses between the appetitive and aversive reaction depending on past experience. To model the non-deterministic behavior observed in the experiments we assume that the two available behavioral options involve different costs of responding. These costs of responding, however, fluctuate from trial to trial causing no bias on average. For instance, a fly which happens to find itself to the right of the group initially could well have a smaller cost of responding for staying on this side of the assessment tube on this trial. More generally, the stochastic costs of responding can be seen as incorporating all other factors that also influence the behavior but do not depend on the past experiences that involve the conditioned stimulus. The total reward received by the agent is the external reinforcement () minus the cost of responding. Our agent takes this into account in decision making, and so the costs of responding result in trial to trial fluctuation in the behavior. Whether the appetitive reaction results in depends on the state of the environment. This state changes slowly over time (according to a Markov chain, see Methods and Fig. 1A). So when the appetitive reaction results in on one trial, the same outcome is likely on an immediately subsequent trial, but as time goes by the odds increase that the appetitive reaction results in or even punishment.


:coffee:

The agent maintains a belief about the environmental state

If the agent knew the environmental state, the best policy would be simple: choose the appetitive (aversive) reaction if the environmental state is rewarding (punishing). Typically however, the agent does not know the actual environmental state but, at best, maintains a belief about it (see Fig. 2A and Methods). In our model, the belief consists of the probabilities , and to receive rewarding, neutral or punishing reinforcement, respectively, after selecting the appetitive reaction. Geometrically, the belief can be represented as a position in a 2-dimensional belief space that stepwise changes after the appetitive reaction and thus gaining new information about the current environmental state and otherwise drifts towards an equilibrium (forgetting), see Fig. 2B (note that, since the three probabilities sum to one, the probability of the neutral state can be computed given the probabilities of the rewarding and punishing state, i.e. ).


Even fruit flies have beliefs ...