#### Project Description

**Deep Q Networks for exploration of an autonomous agent:**

In this project, an agent has to start from scratch in a previously unknown UnityML Enviornment and learn to navigate the enviornment by collecting the maximum amount of reward(yellow bananas) and avoid bad reward(blue bananas). A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana. The agent previously does not know the rules of this game and solely learns from interactions and reward feedback mechanism. The state space constitutes of 37 dimensions including agent's velocity and ray-based perception vector of objects around agent's forward direction. Given this information, the agent has to learn how to best select actions. Four discrete actions are available, corresponding to:
- move forward.`0`

- move backward.`1`

- turn left.`2`

- turn right.`3`

### Algorithm Implementation Details:

DQN Algorithm with Replay Buffer was used to solve this problem:### Results:

**Deep Deterministic Policy Gradients for Continuous Control of 2-DOF Manipulator (Single Agent):**

In this project, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.
The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.
### Algorithm Implementation Details:

DDPG algorithm was used to solve this problem.### Results:

**Deep Deterministic Policy Gradients for Multi-Agent Tennis Game (Multi-Agent):**

In this project, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.
The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.
The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,
- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
- This yields a single
**score**for each episode.

**scores**is at least +0.5.