You can also find my articles on my Google Scholar Profile.
Show selected / Show all by date / Show all by topic
Learning to Manipulate from Passive Videos
Jianren Wang, Sudeep Dasari, Shubham Tulsiani, Abhinav Gupta
In Submission
[Project Page] [Code] [Abstract] [Bibtex]

How do we use videos of human-object interaction, without any action labels, to train manipulation policies? The traditional approach is to estimate actions (e.g. by mapping human behavior to robot) from the video stream and learn a policy that imitates them (e.g. using behavior cloning). This action prediction approach has two serious issues. First, the estimated actions are noisy and using them in behavior learning leads to brittle policies. Second, because actions are naturally multi-modal (i.e. multiple actions lead to same effect), learning policies that capture this multi-modality across diverse demonstrations is difficult. We provide an alternative approach -- instead of learning policies for directly predicting the action to reach a desired goal, we learn a distance prediction function that estimates how far will one be from the goal after a possible action. These distances can parameterize a policy using simple greedy action selection (i.e. pick action w/ lowest predicted distance). A key advantage of our formulation is that it allows us to better exploit passive data as shown by our experimental results.

    title={Learning to Manipulate from Passive Videos},
    author={Wang, Jianren and Dasari, Sudeep and Tulsiani, Shubham and Gupta, Abhinav},
RB2: Robotic Manipulation Benchmarking with a Twist
Sudeep Dasari, Jianren Wang, Joyce Hong, Shikhar Bahl, Yixin Lin, Austin Wang
Abitha Thankaraj, Karanbir Chahal, Berk Calli, Saurabh Gupta, David Held
Lerrel Pinto, Deepak Pathak, Vikash Kumar, Abhinav Gupta
2021 Conference on Neural Information Processing Systems
[Project Page] [Code] [Abstract] [Bibtex]

Benchmarks offer a scientific way to compare algorithms using objective performance metrics. Good benchmarks have two features: (a) they should be widely useful for many research groups; (b) and they should produce reproducible findings. In robotic manipulation research, there is a trade-off between reproducibility and broad accessibility. If the benchmark is kept restrictive (fixed hardware, objects), the numbers are reproducible but the setup becomes less general. On the other hand, a benchmark could be a loose set of protocols (e.g. YCB object set) but the underlying variation in setups make the results non-reproducible. In this paper, we re-imagine benchmarking for robotic manipulation as state-of-the-art algorithmic implementations, alongside the usual set of tasks and experimental protocols. The added baseline implementations will provide a way to easily recreate SOTA numbers in a new local robotic setup, thus providing credible relative rankings between existing approaches and new work. However, these "local rankings" could vary between different setups. To resolve this issue, we build a mechanism for pooling experimental data between labs, and thus we establish a single global ranking for existing (and proposed) SOTA algorithms. Our benchmark, called Ranking-Based Robotics Benchmark (RB2), is evaluated on tasks that are inspired from clinically validated Southampton Hand Assessment Procedures. Our benchmark was run across two different labs and reveals several surprising findings. For example, extremely simple baselines like open-loop behavior cloning, outperform more complicated models (e.g. closed loop, RNN, Offline-RL, etc.) that are preferred by the field. We hope our fellow researchers will use \name to improve their research's quality and rigor.

    title={RB2: Robotic Manipulation Benchmarking with a Twist},
    author={Dasari, Sudeep and Wang, Jianren and ... and Gupta, Saurabh and Held, David and Pinto, Lerrel and Pathak, Deepak and Kumar, Vikash and Gupta, Abhinav},
    journal={Thirty-fifth Conference on Neural Information Processing Systems},
Wanderlust: Online Continual Object Detection in the Real World
Jianren Wang, Xin Wang, Yue Shang-Guan, Abhinav Gupta
2021 International Conference on Computer Vision
[Project Page] [Code] [Abstract] [Bibtex]

Online continual learning from data streams in dynamic environments is a critical direction in the computer vision field. However, realistic benchmarks and fundamental studies in this line are still missing. To bridge the gap, we present a new online continual object detection benchmark with an egocentric video dataset, Objects Around Krishna (OAK). OAK adopts the KrishnaCAM videos, an ego-centric video stream collected over nine months by a graduate student. OAK provides exhaustive bounding box annotations of 80 video snippets (~17.5 hours) for 105 object categories in outdoor scenes. The emergence of new object categories in our benchmark follows a pattern similar to what a single person might see in their day-to-day life. The dataset also captures the natural distribution shifts as the person travels to different places. These egocentric long running videos provide a realistic playground for continual learning algorithms, especially in online embodied settings. We also introduce new evaluation metrics to evaluate the model performance and catastrophic forgetting and provide baseline studies for online continual object detection. We believe this benchmark will pose new exciting challenges for learning from non-stationary data in continual learning.

    title={Wanderlust: Online Continual Object Detection in the Real World},
    author={Wang, Jianren and Wang, Xin and Shang-Guan, Yue and Gupta, Abhinav},
SEMI: Self-supervised Exploration via Multisensory Incongruity
Jianren Wang*, Ziwen Zhuang*, Hang Zhao (* indicates equal contribution)
2022 IEEE International Conference on Robotics and Automation
[Project Page] [Code] [Abstract] [Bibtex]

Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input, and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.

    title={SEMI: Self-supervised Exploration via Multisensory Incongruity},
    author={Wang, Jianren and Zhuang, Ziwen and Zhao, Hang},
    journal={IEEE International Conference on Robotics and Automation},