search - Exploration Algorithm -
massively edited question make easier understand.
given environment arbitrary dimensions , arbitrary positioning of arbitrary number of obstacles, have agent exploring environment limited range of sight (obstacles don't block sight). can move in 4 cardinal directions of nsew, 1 cell @ time, , graph unweighted (each step has cost of 1). linked below map representing agent's (yellow guy) current belief of environment @ instant of planning. time not pass in simulation while agent planning.
http://imagizer.imageshack.us/a/img913/9274/qrsazt.jpg
what exploration algorithm can use maximise cost-efficiency of utility, given revisiting cells allowed? each cell holds utility value. ideally, seek maximise sum of utility of cells seen (not visited) divided path length, although if complex suitable algorithm number of cells seen suffice. there maximum path length in hundreds or higher. (the actual test environments used on agent @ least 4x bigger, although theoretically there no upper bound on dimensions can set, , maximum path length increase accordingly)
i consider bfs , dfs intractable, a* non-optimal given lack of suitable heuristics, , dijkstra's inappropriate in generating single unbroken path. there algorithm can think of? also, need loop detection, i've never done before since allowing revisitations first time.
one approach have considered reduce map spanning tree, except instead of defining tree connects cells, defined tree can see cells. approach result in following:
http://imagizer.imageshack.us/a/img910/3050/hgu40d.jpg
in resultant tree, agent can go node adjacent nodes 0-1 turn away @ intersections. far thinking has gotten right now. solution generated using tree may not optimal, should @ least near-optimal fewer cells being processed algorithm, if make algorithm more tractable, guess acceptable trade-off. i'm still stuck thinking how generate path however.
your problem similar canonical reinforcement learning (rl) problem, grid world. formalize standard markov decision process (mdp) , use rl algorithm solve it.
the formalization be:
- states
s
:nxm
discrete grid. - actions
a
:up, down, left, right
. - reward
r
: value of cells agent can see destination cells'
, i.e.r(s,a,s') = sum(value(seen(s'))
. - transition function:
p(s' | s, a) = 1
ifs'
not out of boundaries or black cell,0
otherwise.
since interested in average reward, discount factor 1
, have normalize cumulative reward number of steps. said each step has cost one, subtract 1 immediate reward r
at each time step, not add since average number of steps.
since problem discrete policy simple softmax (or gibbs) distribution.
as solving algorithm can use q-learning, guarantees optimality of solution provided sufficient number of samples. however, if grid big (and said there no limit) suggest policy search algorithms, policy gradient or relative entropy (although guarantee convergence local optima). can find q-learning everywhere on internet. recent survey on policy search suggest this.
the cool thing these approaches encode exploration in policy (e.g., temperature in softmax policy, variance in gaussian distribution) , try maximize cumulative long term reward described mdp. initialize policy high exploration (e.g., complete random policy) , trial , error algorithm make deterministic , converge optimal 1 (however, stochastic policy optimal). main difference between rl algorithms how perform update of policy @ each iteration , manage tradeoff exploration-exploitation (how should explore vs how should exploit information have).
as suggested demplo, use genetic algorithms (ga), slower , require more tuning (elitism, crossover, mutation...).
i have tried policy search algorithms on problem , seems work well, although initialized grid randomly , not know exact optimal solution. if provide additional details (a test grid, max number of steps , if initial position fixed or random) can test them more precisely.
Comments
Post a Comment