You can also find my articles on my Google Scholar Profile.
Show selected / Show all by date / Show all by topic
Learning to Manipulate by Learning to See
Jianren Wang*, Sudeep Dasari*, Mohan Kumar, Shubham Tulsiani, Abhinav Gupta (* indicates equal contribution)
In Submission
[Project Page] [Code] [Abstract] [Bibtex]

While the field of visual representation learning has seen explosive growth in the past years, its spillover effects in robotics have been surprisingly limited so far. Prior work in this space used pre-trained representations to "warm start" policy and value function learning. This approach is inherently limited, since predicting action sequences and learning value functions both require finicky algorithms and hard-to-tune networks. Instead, we propose to abandon this paradigm and directly leverage representations to control the robotic manipulator. This is made possible by the structure of the learned visual representation space: it naturally encodes relationships between states as distances. These distances can be used to plan for robot behavior by greedily selecting actions that best reach a goal state. This paper develops a simple algorithm for acquiring a distance function and dynamics predictor from pre-trained visual representations, using human collected video sequences. In addition to outperforming baselines that adopt the tradition action/value learning perspective (e.g. our method gets 90% success v.s. 65% for behavior cloning on pushing task), this approach can acquire manipulation controllers without any robot demonstrations or rollouts.

    title={Learning to Manipulate by Learning to See},
    author={Wang, Jianren and Dasari, Sudeep and and Kumar, Mohan and Tulsiani, Shubham and Gupta, Abhinav},
RB2: Robotic Manipulation Benchmarking with a Twist
Sudeep Dasari, Jianren Wang, Joyce Hong, Shikhar Bahl, Yixin Lin, Austin Wang
Abitha Thankaraj, Karanbir Chahal, Berk Calli, Saurabh Gupta, David Held
Lerrel Pinto, Deepak Pathak, Vikash Kumar, Abhinav Gupta
2021 Conference on Neural Information Processing Systems
[Project Page] [Code] [Abstract] [Bibtex]

Benchmarks offer a scientific way to compare algorithms using objective performance metrics. Good benchmarks have two features: (a) they should be widely useful for many research groups; (b) and they should produce reproducible findings. In robotic manipulation research, there is a trade-off between reproducibility and broad accessibility. If the benchmark is kept restrictive (fixed hardware, objects), the numbers are reproducible but the setup becomes less general. On the other hand, a benchmark could be a loose set of protocols (e.g. YCB object set) but the underlying variation in setups make the results non-reproducible. In this paper, we re-imagine benchmarking for robotic manipulation as state-of-the-art algorithmic implementations, alongside the usual set of tasks and experimental protocols. The added baseline implementations will provide a way to easily recreate SOTA numbers in a new local robotic setup, thus providing credible relative rankings between existing approaches and new work. However, these "local rankings" could vary between different setups. To resolve this issue, we build a mechanism for pooling experimental data between labs, and thus we establish a single global ranking for existing (and proposed) SOTA algorithms. Our benchmark, called Ranking-Based Robotics Benchmark (RB2), is evaluated on tasks that are inspired from clinically validated Southampton Hand Assessment Procedures. Our benchmark was run across two different labs and reveals several surprising findings. For example, extremely simple baselines like open-loop behavior cloning, outperform more complicated models (e.g. closed loop, RNN, Offline-RL, etc.) that are preferred by the field. We hope our fellow researchers will use \name to improve their research's quality and rigor.

    title={RB2: Robotic Manipulation Benchmarking with a Twist},
    author={Dasari, Sudeep and Wang, Jianren and ... and Gupta, Saurabh and Held, David and Pinto, Lerrel and Pathak, Deepak and Kumar, Vikash and Gupta, Abhinav},
    journal={Thirty-fifth Conference on Neural Information Processing Systems},
Wanderlust: Online Continual Object Detection in the Real World
Jianren Wang, Xin Wang, Yue Shang-Guan, Abhinav Gupta
2021 International Conference on Computer Vision
[Project Page] [Code] [Abstract] [Bibtex]

Online continual learning from data streams in dynamic environments is a critical direction in the computer vision field. However, realistic benchmarks and fundamental studies in this line are still missing. To bridge the gap, we present a new online continual object detection benchmark with an egocentric video dataset, Objects Around Krishna (OAK). OAK adopts the KrishnaCAM videos, an ego-centric video stream collected over nine months by a graduate student. OAK provides exhaustive bounding box annotations of 80 video snippets (~17.5 hours) for 105 object categories in outdoor scenes. The emergence of new object categories in our benchmark follows a pattern similar to what a single person might see in their day-to-day life. The dataset also captures the natural distribution shifts as the person travels to different places. These egocentric long running videos provide a realistic playground for continual learning algorithms, especially in online embodied settings. We also introduce new evaluation metrics to evaluate the model performance and catastrophic forgetting and provide baseline studies for online continual object detection. We believe this benchmark will pose new exciting challenges for learning from non-stationary data in continual learning.

    title={Wanderlust: Online Continual Object Detection in the Real World},
    author={Wang, Jianren and Wang, Xin and Shang-Guan, Yue and Gupta, Abhinav},
SEMI: Self-supervised Exploration via Multisensory Incongruity
Jianren Wang*, Ziwen Zhuang*, Hang Zhao (* indicates equal contribution)
2022 IEEE International Conference on Robotics and Automation
[Project Page] [Code] [Abstract] [Bibtex]

Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input, and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.

    title={SEMI: Self-supervised Exploration via Multisensory Incongruity},
    author={Wang, Jianren and Zhuang, Ziwen and Zhao, Hang},
    journal={IEEE International Conference on Robotics and Automation},