BioVL

Research Results

Research Project Overview

We are conducting research to support the recording and analysis of biochemical experiments. Aiming to improve the reproducibility of experiments and facilitate effective knowledge sharing among researchers, we are working on constructing a dataset that integrates video and language, as well as developing technologies that leverage this dataset.

So far, we have proposed a dataset that integrates experimental videos and protocols, developed an automatic protocol generation method, explored object detection techniques utilizing Micro QR codes, conducted research on error detection in procedural execution, and built a task support system. Through these studies, we are pioneering new possibilities for recording and analyzing biochemical experiments.

Research Details

Egocentric Biochemical Video-and-Language Dataset

Taichi Nishimura, Kojiro Sakoda, Atsushi Hashimoto, Yoshitaka Ushiku, Natsuko Tanaka, Fumihito Ono, Hirotaka Kameko, Shinsuke Mori

CLVL (ICCV workshop), 2021.

biovl

This paper proposes a novel biochemical video-and-language (BioVL) dataset, which consists of experimental videos, corresponding protocols, and annotations of alignment between events in the video and instructions in the protocol. The key strength of the dataset is its user-oriented design of data collection. We imagine that biochemical researchers easily take videos and share them for another researcher's replication in the future. To minimize the burden of video recording, we adopted an unedited first-person video as a visual source. As a result, we collected 16 videos from four protocols with a total length of 1.6 hours. In our experiments, we conduct two zero-shot video-and-language tasks on the BioVL dataset. Our experimental results show a large room for improvement for practical use even utilizing the state-of-the-art pre-trained video-and-language joint embedding model. We are going to release the BioVL dataset. To our knowledge, this work is the first attempt to release the biochemical video-and-language dataset.

BioVL2データセット:生化学分野における一人称視点の実験映像への言語アノテーション

西村 太一, 迫田 航次郎, 牛久 敦, 橋本 敦史, 奥田 奈津子, 小野 富三人, 亀甲 博貴, 森 信介

自然言語処理, Vol.4, No.29, 2022.

biovl2

本論文では,生化学分野における一人称の実験映像データセットであるBioVL2データセットを提案する.BioVL2データセットは生化学における4種類の基本的実験に対し,それぞれ8動画撮影した合計32,総時間2.5時間の映像からなるデータセットである.各映像はプロトコルと紐づいており,言語アノテーションとして(1)視覚と言語の対応関係のアノテーション,(2)プロトコル中に現れる物体の矩形アノテーションの2種類のアノテーションを付与している.構築したデータセットの応用例として,本研究では実験映像からプロトコルを自動生成する課題に取り組んだ.定量的,定性的な評価の結果,開発した手法はフレームに映っている物体名をそのままプロトコルとして出力する弱いベースラインと比較して,適切なプロトコルを生成できることを確認した.なお,BioVL2データセットは研究用途に限定してデータセットを公開する予定である.

BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes

Tomohiro Nishimoto, Taichi Nishimura, Koki Yamamoto, Keisuke Shirai, Hirotaka Kameko, Yuto Haneji, Tomoya Yoshida, Keiya Kajimura, Taiyu Cui, Chihiro Nishiwaki, Eriko Daikoku, Natsuko Okuda, Fumihito Ono, Shinsuke Mori

arXiv, 2025.

biovlqr

This paper introduces BioVL-QR, a biochemical vision-and-language dataset comprising 23 egocentric experiment videos, the corresponding protocols, and vision-and-language alignments. A major challenge in understanding biochemical videos is detecting equipment, reagents, and containers because of the cluttered environment and indistinguishable objects. Previous studies assumed manual object annotation, which is costly and time-consuming. To address the issue, we focus on Micro QR Codes. However, detecting objects using only Micro QR Codes is still difficult due to blur and occlusion caused by object manipulation. To overcome this, we propose an object labeling method combining a Micro QR Code detector with an off-the-shelf hand object detector. As an application of the method and BioVL-QR, we tackled the task of localizing the procedural steps in an instructional video. The experimental results show that using Micro QR Codes and our method improves biochemical video understanding.

EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos Referring to Procedural Texts

Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, Shinsuke Mori

arXiv, 2024.

egooops

Mistake action detection is crucial for developing intelligent archives that detect workers’ errors and provide feedback. Existing studies have focused on visually apparent mistakes in free-style activities, resulting in video-only approaches to mistake detection. However, in text-following activities, models cannot determine the correctness of some actions without referring to the texts. Additionally, current mistake datasets rarely use procedural texts for video recording except for cooking. To fill these gaps, this paper proposes the EgoOops dataset, where egocentric videos record erroneous activities when following procedural texts across diverse domains. It features three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. We also propose a mistake detection approach, combining videotext alignment and mistake label classification to leverage the texts. Our experimental results show that incorporating procedural texts is essential for mistake detection.

一人称視点映像を用いたマルチモーダル作業支援システム

梶村 恵矢, 西村 太一, 羽路 悠斗, 山本 航輝, 崔 泰毓, 亀甲 博貴, 森 信介

言語処理学会第30回年次大会, 2024.

biowebsystem

実験や料理など,作業者が手順書に従って作業を行う状況において,不安のある手順や不明瞭な手順を映像として確認できることは作業の再現性向上に有効に働くと考えられる.また,映像中の作業者の視線情報や手元の細かな動作を確認できる点において,実際の作業者が確認する映像として一人称視点動画を用いることはメリットがある.本研究では広く人手で行う作業の再現性向上を目的とし,テキストと音声による入力機能を持った一人称視点動画を用いたマルチモーダル作業支援システムを提案する.また,実験では作業負荷や作業完成度にシステムが与える影響を調べ,システムが負荷の軽減や完成度の向上に寄与するかを考察する.