Link Search Menu Expand Document

Exercises

  1. Submission
    1. Individual
    2. Team
    3. Deadline
  2. 👤 Individual
    1. 📨 Deliverable 1 - Bags of Visual Words [25 pts]
  3. 👥 Team
    1. Using Neural Networks for Object Detection
      1. Installation
      2. Usage
    2. 📨 Deliverable 2 - Object Localization [45 pts]
      1. Performance Expectations
    3. 📨 Deliverable 3 - Place Recognition using BoW [30 pts]
      1. Installation
      2. Usage
      3. Expectation
    4. 📨 Deliverable 4 [Optional] - Evaluating BoW Place Recognition using RANSAC [10 pts]
      1. Expectation
    5. Summary of Team Deliverables

Submission

To submit your solutions create a folder called lab8 and push one or more files to your repository with your answers (it can be plain text, markdown, pdf or whatever other format is reasonably easy to read)

Individual

Please push the deliverables into your personal repository. Only typeset PDFs (e.g. using LaTeX, Word, Markdown, or similar software) are accepted.

Team

Please push the source code for the entire package to the folder lab8 of the team repository. For the tables and discussion questions, please push a PDF to the lab8 folder of your team repository.

Reminder: Please make sure that all of your final results and figures appear in your PDF submission. We do not have time to build and run everyone’s code to check every individual result.

Deadline

Deadline: the VNAV staff will clone your repository on November 1st at 1 PM EDT.

👤 Individual

📨 Deliverable 1 - Bags of Visual Words [25 pts]

Please answer the following questions; the complete writeup should be between 1/2 to 1 page.

  1. Explain which components in a basic BoW-based place recognition system determine the robustness of the system to illumination and 3D viewpoint changes. Why? Aim for 75-125 words, and try to give specific examples.
    • Hint: You may find it enlightening to read the DBoW paper on the subject, though you should be able to answer based on this week’s lectures.
  2. Explain the purpose of Inverse Document Frequency (IDF) term in tf-idf. What would happen without this term and why? Aim for 75-125 words.
    • Hint: Consider the case where a few words are very common across almost all documents/images. Also, you can check for resources about IDF online (such as this one) if you would like to build your intuition.
  3. How does the vocabulary size in BoW-based systems affect the performance of the system, particularly in terms of computational cost and precision/recall? Aim for 75-125 words.
    • Hint: For precision, how would adding words to the vocabulary make it easier/harder to recognize when 2 documents/images are very similar or different? Likewise for recall?

👥 Team

Using Neural Networks for Object Detection

YOLO is a Convolutional Neural Network that detects objects of multiple classes. It is based on the paper “You Only Look Once: Unified, Real-Time Object Detection”. Every detected object is marked by a bounding box. The level of confidence for each detection is given as a probability metric (more details can be found in YOLOv3 page). Since we are using ROS for most of our software, we will use the repository in darknet_ros.

Installation

This lab assumes you already have a catkin workspace setup, ROS, OpenCV, and GTSAM installed from the earlier labs. If you have not please refer to the earlier labs to setup a vnav_ws.

Concerning the installation of the darknet_ros package, we ask you to follow the installation procedure in the Readme of the repo. You can use the automatically downloaded weights that are acquired from building the package:

# Download the repo:
cd vnav_ws/src
git clone --recursive https://github.com/Schmluk/darknet_ros.git
cd ../

# Build the package.
# NOTE: This will automatically use the GPU if available, otherwise CPU (GPU is recommended for performance).
catkin build darknet_ros

# Make sure the installation is correct:
catkin build darknet_ros --no-deps --verbose --catkin-make-args run_tests

You should see an image with two bounding boxes indicating that there is a person (albeit incorrectly).

Usage

Make sure you read the Readme in the repo, in particular the Nodes section which introduces the parameters used by YOLO and the ROS topics where the output is published.

Now, download the following rosbag, Sequence freiburg3_teddy, taken from the RGB-D TUM dataset.

Now, change ~/vnav_ws/src/darknet_ros/darknet_ros/config/ros.yaml with the corresponding rgb topic in each dataset. For example, for sequence freiburg3_teddy, change ros.yaml as:

subscribers:
  camera_reading:
    topic: /camera/rgb/image_color
    queue_size: 1

Now, open two terminals. In one, run YOLO:

roslaunch darknet_ros darknet_ros.launch

While in the other terminal, you should play the actual rosbag (try with freiburg3_teddy rosbag):

rosbag play PATH/TO/ROSBAG/DOWNLOADED

Great! Now you should be seeing YOLO detecting objects in the scene!

📨 Deliverable 2 - Object Localization [45 pts]

Our goal for this exercise is to localize the teddy bear that is at the center of the scene in the freiburg3_teddy dataset. To do so, we will use YOLO detections to know where the teddy bear is. With the bounding box of the teddy bear, we can calculate a crude approximation of the bear’s 3D position by using the center pixel of the bounding box. If we accumulate enough 2D measurements, we can formulate a least-squares problem in GTSAM to triangulate the 3D position of the teddy bear.

For that, we will need to perform the following steps:

  1. The freiburg3_teddy rosbag provides ground-truth transformation of the camera with respect to the world. Subscribe to the tf topic in ROS that gives the transform of the camera with respect to the world.
  2. In parallel, you should be able to get the results from darknet_ros (YOLO) by either making the node subscribe to the stream of images, or using the Action message that the package offers.
  3. Use YOLO to detect the bounding box around the teddy bear.
  4. Extract the center pixel of the bounding box.
  5. While this is a rough approximation, formulate a GTSAM problem where we are trying to estimate the 3D position of the center pixel in the bounding box. You will need to use multiple GenericProjectionFactors in order to fully constrain the 3D position of the teddy bear. Try to use the GenericProjectionFactor to estimate the 3D position of the teddy bear. Recall the GTSAM exercise where you performed a toy-example of Bundle Adjustment problem and use the same factors to build the problem. Note that now, the poses of the camera are given to you as ground-truth information. Therefore, you might want to use priors on the poses as given by the ground-truth poses given by the tf topic.
  6. Solve the problem in GTSAM. You can re-use previous code from lab_7.
  7. Plot the 3D position of the teddy bear in Rviz.
  8. Plot also the trajectory of the camera. You can re-use previous code from lab_7

Since there are many ways to solve this problem, and since we have reached a point where you should be comfortable designing your own ROS callbacks and general code architecture, we leave this problem open to your own implementation style. Nonetheless, we do provide some minimal starter code and hints (see deliverable_2.cpp, along with helper_functions.hpp). Please feel free to post in Piazza or reach out via email or office hours if you need some advice on architecting a solution.

When evaluating this deliverable we will not focus on the end result (although it will count), but on your implementation, as well as your assumptions and considerations. Therefore, we ask you to write a small summary of the assumptions, design choices and considerations that you have taken in order to solve this problem. There is no right or wrong answer as many approaches would reach a similar result, but we will look at the principles you apply when solving this problem. Consider this deliverable as a preparation for what we will look for in the final project. Aim for around 250 words, or half a page.

Performance Expectations

Your final RVIZ figure should look something like the following image. In particular, try to show both the trajectory of the camera (green), the camera poses for which you got a good detection of the teddy bear (red arrows), and a geometry_msgs::PointStamped for the teddy bear’s estimated location (purple sphere). Note that the size of the sphere does not matter as long as it is visible, although you are welcome to compute the covariance of your estimate and draw a PoseWithCovariance if you would like the size to represent the covariance.

Deliverable 2

📨 Deliverable 3 - Place Recognition using BoW [30 pts]

DBoW2 is a state-of-the-art algorithm for place recognition (loop closure). It is based in a Bag of Words technique (details in their paper).

Place recognition is a common module in a SLAM pipeline and it is often used as a parallel process to the actual Visual Odometry pipeline. Whenever a place is recognized as having been visited previously, this module computes the relative pose between the camera that took the first image of the scene and the current camera. Then, the SLAM system fuses this result with the visual odometry (typically adding a new factor to the factor graph). Note that the module might fail at recognizing a scene, which might result in a lack of loop closures, or - what is worst - provide wrong matches.

For this exercise, we ask you to assess the quality of the loop closures extracted by DBoW2.

To start, you should download the MH_01_easy sequence of the EuRoC dataset (available here) and place it under the path /home/$USER/datasets/vnav/ (i.e., such that /home/$USER/datasets/vnav/MH_01_easy exists).

Installation

Based on the workspace you built up for vnav, the following steps should work:

cd ~/wnav_ws/src

# Install dbow2 as catkin package (dbow2_catkin):
git clone git@github.com:MIT-SPARK/dbow2_catkin.git

# Download the demo code already available for deliverable 3 to your workspace:
https://github.com/MIT-SPARK/VNAV-labs/tree/main/lab8 # Download manually.

cd lab8

# To run dbow2, download the ORB vocabulary from GDrive and place it in the 'lab8' folder:
https://drive.google.com/file/d/1dEepkrQUsSgDZPOWhvtOO1XRTA4Tlkmv/view # Download manually.
unzip vocabulary.zip
rm vocabulary.zip


# Implement the deliverables and build:
catkin build lab_8

Usage

Once implemented and built, you can run the deliverable using (replace the euroc path as suitable):

roslaunch lab_8 loop_closure.launch path_to_euroc:=/home/$USER/datasets/vnav/MH_01_easy score_threshold:=0 inlier_threshold:=0

Expectation

For this deliverable, you will need to perform the following steps:

  1. Implement the portions of the code in deliverable_3.cpp that are missing to run DBoW on the Euroc dataset (labeled Part 3.1 to 3.3).
  2. After you have implemented these, run the code as explained above. You should now see the trajectory in blue and all proposed loop-closures as in the picture below:

Deliverable 3

The loop closures are colored according to the DBoW score, from lowest (red) to highest (green). What do you observe?

  1. Run the code again, specifying different values for score_threshold. What is a good value? What do you observe?

📨 Deliverable 4 [Optional] - Evaluating BoW Place Recognition using RANSAC [10 pts]

In this deliverable, we will use the tracking code you already implemented in lab5 for geometric verification of the proposed loop closures. This is an extension of the previous deliverable and can be implemented in deliverable_3.cpp.

Expectation

For this deliverable, you will need to perform the following steps:

  1. Implement the portions of the code in deliverable_3.cpp that are missing to run DBoW on the Euroc dataset (labeled Part 3.4).
  2. After you have implemented these, run the code as explained above. By setting inlier_threshold:=1, you should now see the trajectory in blue and all proposed loop-closures as in the picture below:

Deliverable 4

The loop closures are colored according to the number of inliers, from lowest (red) to highest (green). What do you observe?

  1. Run the code again, specifying different values for inlier_threshold. What is a good value? What do you observe?

  2. Why does the number of inliers indicates that the loop closure is geometrically sound? How does it compare to the DBoW score?

Final Detected Loop Closures Possible set of final detected loop closures. Are these good loop closures?

Summary of Team Deliverables

  1. A 1/2 page summary of the implementation and assumptions made by your Object Localization code, along with a final position estimate of the teddy bear in the world reference frame.
  2. An image showing the trajectory of the robot and the final estimated location of the teddy bear in RVIZ
  3. Two images, one showing all detected initial loop closures, and one showing all selected final loop closures. Write a short paragraph listing your final score_threshold and your observations when adjusting the score.
  4. [Optional] Two images, one showing all loop closures colored by number of inliers, and one showing the selected final loop closures. Write a short paragraph listing your final value and observations when adjusting the inlier threshold. Explain why a high number of inliers indicates that a loop closure is geometrically sound and how this compares to the DBoW score.

Copyright © 2018-2022 MIT. This work is licensed under CC BY 4.0