Domain Adaptation for Visual Understanding by Richa Singh & Mayank Vatsa & Vishal M. Patel & Nalini Ratha

Domain Adaptation for Visual Understanding by Richa Singh & Mayank Vatsa & Vishal M. Patel & Nalini Ratha

Author:Richa Singh & Mayank Vatsa & Vishal M. Patel & Nalini Ratha
Language: eng
Format: epub
ISBN: 9783030306717
Publisher: Springer International Publishing


where N represents the total number of video frames, start and end represent the start and end point of the local video segment; notice that . We use average pooling to aggregate the features in the time span. Then, L2 Normalization after pooling is applied to rescale the vision features.

Simultaneously, putting local video features and context video features into the model could weakly help the model learn temporal relation between the video segment and the entire video. To model more temporal information that indicates whether the video segment matches the language query, we add a temporal point which represents the time span into video features. The temporal features are also normalized(to [0, 1]) to be in the same numerical scale with video features. Finally, we concatenate video context features , video local features , and temporal features to construct input video representation .

Since a video consists of several still images, we could use knowledge learned from the image dataset to learn the video information. We use the model pretrained on ImageNet [10] to extract appearance feature from the video dataset. Appearance information can represent the object and other attributes in still video frames. In video recognition, motion feature is also widely used to recognize video action in the form of optical flow [17]. To model the motion information of videos, we use a video recognition network [25] to extract motion feature. In our experiments, we construct our vision features individually with the appearance and motion feature. Two ensemble retrieval models are trained respectively with appearance and motion feature and aggregated with late fusion.

The video embedding network is constructed with two fully connected layers with ReLU. The first fully connected layer in each video embedding network is shared to reduce model parameters.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.