SIDQL: An Efficient Keyframe Extraction and Motion Reconstruction Framework in Motion Capture

Xuling ZHANG xzhang659@connect.hkust-gz.edu.cn 1234-5678-9012 Ziru ZHANG zzhang758@connect.hkust-gz.edu.cn The Hong Kong University of Science and Technology (Guangzhou)GuangzhouChina511453 Yuyang WANG The Hong Kong University of Science and Technology (Guangzhou)GuangzhouChina yuyangwang@hkust-gz.edu.cn Lik-Hang LEE The Hong Kong Polytechnic UniversityHong KongChina lik-hang.lee@polyu.edu.hk  and  Pan HUI The Hong Kong University of Science and Technology (Guangzhou)GuangzhouChina panhui@ust.hk
(2024)
Abstract.

Metaverse, which integrates the virtual and physical worlds, has emerged as an innovative paradigm for changing people’s lifestyles. Motion capture has become a reliable approach to achieve seamless synchronization of the movements between avatars and human beings, which plays an important role in diverse Metaverse applications. However, due to the continuous growth of data, current communication systems face a significant challenge of meeting the demand of ultra-low latency during application. In addition, current methods also have shortcomings when selecting keyframes, e.g., relying on recognizing motion types and artificially selected keyframes. Therefore, the utilization of keyframe extraction and motion reconstruction techniques could be considered a feasible and promising solution. In this work, a new motion reconstruction algorithm is designed in a spherical coordinate system involving location and velocity information. Then, we formalize the keyframe extraction problem into an optimization problem to reduce the reconstruction error. Using Deep Q-Learning (DQL), the Spherical Interpolation based Deep Q-Learning (SIDQL) framework is proposed to generate proper keyframes for reconstructing the motion sequences. We use the CMU database to train and evaluate the framework. Our scheme can significantly reduce the data volume and transmission latency compared to various baselines while maintaining a reconstruction error of less than 0.09 when extracting five keyframes.

Keyframe Extraction, Motion Reconstruction, Deep Q-Learning, Spherical Interpolation.
copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXccs: Computing methodologies Reinforcement learningccs: Computing methodologies Motion processingccs: Computing methodologies Motion captureccs: Computing methodologies 3D imaging

1. Introduction

The Metaverse, a virtual realm that coexists alongside the tangible world, is forged by a synergy of technologies, including artificial intelligence (AI), virtual reality (VR), augmented reality (AR), and human-machine interaction (HMI). Esteemed for its high application potential, it finds extensive use across various domains such as gaming, education, healthcare, and defense (Lee et al., 2021). Moreover, the Metaverse is revolutionizing human communication modalities. Through VR and AR devices, users engage in a deeply immersive environment that surpasses the engagement offered by traditional, two-dimensional settings. This advancement fosters enhanced communicative efficiency both amongst users and between users and machines (Zhao et al., 2022). Within the Metaverse, avatars stand as a pivotal concept, with each user adopting a digital persona for virtual interactions. These avatars, three-dimensional digital entities endowed with anthropomorphic traits via convergent technologies, serve as the principal conduit for an individual’s perception of virtual temporality within the Metaverse. Looking ahead, users will possess distinctive avatars, leveraging these virtual proxies to mirror real-life movements within the exploratory confines of the virtual domain (Lam et al., 2022; Chen et al., 2023).

Motion capture technology is the predominant method for synchronizing avatars’ movements in the Metaverse. By capturing the movements of actors or performers, motion capture enables the creation of avatars that mimic their actions, transporting users into a world where they can seamlessly interact with these digital representations. With motion capture techniques, avatars enable users to embark on immersive journeys, participate in social interactions, and even engage in exhilarating 3D games, which plays a pivotal role in the development of the Metaverse (Vlahovic et al., [n. d.]). In addition, the combination of motion capture, 3D animations, and 3D games presents an unprecedented opportunity to bridge the gap between the real and digital worlds, pushing the boundaries of imagination and unleashing a new era of interactive and immersive experiences (Li et al., 2020).

The naive approach of synchronizing an avatar’s movements in the Metaverse requires transmitting a large amount of motion data in a short time, unavoidably causing perceptible delays in avatar movements. So, keyframe extraction and motion reconstruction techniques can be utilized to reduce the latency. Keyframe extraction techniques, commonly used in videos, are valuable as they summarize video content by identifying a concise set of representative frames. It can reduce the amount of motion data that needs to be transmitted to the users. There has been a lot of research on video-based keyframe extraction these years, which provides valuable insights and techniques that enhance video storage, transmission and summarization (Yan and Woźniak, 2022; Sujatha and Mudenagudi, 2011). Due to the different data formats, keyframe extraction in motion capture sequences takes a distinct approach compared to video content. Keyframe extraction in motion capture involves selecting a set of representative frames, which correspond to the positions of human skeletal points, to represent the overall motion sequence effectively. Keyframes represent the significant poses or positions captured during the motion capture process (Xia et al., 2022). Therefore, we can transmit the keyframes of motion data in situations of network fluctuations, effectively meeting the requirement for low latency in the Metaverse.

However, we cannot directly show the extracted keyframes on the user’s device after keyframe extraction and transmission, as the transitions between these keyframes may appear sudden or stiff. To enhance smooth video playback for users, advanced video technologies utilize frame reconstruction techniques after the frame reduction (Yu et al., 2020; Pollefeys et al., 2008). Similar research exists in the field of motion capture. Motion reconstruction in motion capture is to generate natural in-between motions in terms of a given start-end keyframe pair (Xia et al., 2019). It can be utilized to fill this gap by creating coherent and realistic in-between motions that smoothly connect the keyframes, enhancing the overall quality of the captured motion.

Thus, novel motion capture techniques are crucial in achieving realistic and synchronized avatars’ movements in the Metaverse. As virtual environments become increasingly immersive, it is essential to accurately capture and map human motion in the real world to the virtual world. Since human motion data is usually captured at high frequency, the huge amount of motion data increases transmission consumption, resulting in a delay in avatar movements. However, avatars’ movements in the Metaverse have to meet low latency requirements. Therefore, keyframe extraction and motion reconstruction in motion capture become vital challenges to meet the low latency requirement. With these two methods, we are able to transmit only the keyframes of motion data despite the fluctuating network, and then reconstruct the frames on the user side, achieving movement synchronization between avatars and humans.

According to the existing work, we find that current AI-based methods always rely on labeled data for training. However, the accuracy of the labeled keyframes is uncertain as the labeled keyframes are based only on human intuition rather than selecting the most suitable keyframes. Since DQL can solve this problem, opportunities exist for designing a new keyframe extraction algorithm on top of DQL without using labeled dataset for training.

In parallel, some motion reconstruction methods ignore the constancy of bone length, resulting in incoherent reconstructed movements. Also, some approaches need to identify the motion category before the motion can be reconstructed, which makes the motion reconstruction limited by the motion category and takes a long time to reconstruct. Moreover, velocity information of human bones is often disregarded in current motion reconstruction methods, causing the underutilization of information in non-keyframes. Inspired by the above-mentioned articles and algorithms, the article aims to investigate a new motion reconstruction method that fully utilizes the velocity information to improve the quality of reconstructed motion. We propose a new motion reconstruction algorithm considering the information hidden in the non-keyframes. The proposed reconstruction algorithm, primarily a mathematical model and DQL, serves to decide the keyframe extraction optimally. Finally, the model is evaluated via experiments using motion data in the mixed category. The major contributions of this paper can be summarized as follows:

  • We first develop a new motion reconstruction algorithm in a spherical coordinate system. The location and velocity data of each point are converted into a spherical coordinate system to keep the length of the bones constant. Then, a spherical interpolation approach is conducted to reconstruct the middle frames. The motion of the root point is also reconstructed by utilizing polynomial interpolation methods.

  • To minimize the mean reconstruction error, we formalize the keyframe extraction problem into an optimization problem according to the motion reconstruction method. A SIDQL framework for keyframe extraction and motion reconstruction is then proposed by using a special reward function developed based on mean error. The SIDQL framework can be trained by mixed-category motion sequences without labeled keyframes.

  • We then conduct comprehensive experiments to study the impact of different hyperparameters. The reconstruction performance is measured by comparison analysis with various baselines under various numbers of keyframes. The experiments give evidence that our scheme can significantly reduce the data volume with promising reconstruction accuracy.

2. RELATED WORK

Keyframe extraction and motion reconstruction are foundation techniques for various fields, such as animation, gaming, and medical research. To reduce data redundancy and improve transmission efficiency, researchers proposed various keyframe extraction models in video and motion capture scenarios. In addition, motion reconstruction approaches are widely applied to enhance the visual quality in video sequences or motion capture data. This section briefly summarizes some previous studies about different keyframe extraction models and motion reconstruction approaches.

2.1. Keyframe Extraction

Keyframe extraction plays a significant role in both video and motion capture domains. The extracted keyframes can reduce data redundancy, enhance transmission efficiency and provide meaningful representations for various applications, including video summarization, motion analysis and animation synthesis.

Keyframe extraction in videos refers to the process of selecting a subset of frames that represent the essential content or significant moments within the video sequence. Some researchers use clustering methods to extract keyframes in videos. For example, Guru et al. (Guru et al., 2018), and Jadon et al. (Jadon and Jasim, 2020) utilized Gaussian Mixture Model and K-means clustering method to cluster all frames based on features, where the cluster centers represented the keyframes. In addition to clustering algorithms, deep learning is also commonly used for video keyframe extraction. For example, Ly et al. (Ly et al., 2018) first utilized the Convolutional Neural Network (CNN) and Iterative Quantization (ITQ) method to calculate the hamming distance of each frame. If the hamming distance differed from the previous frame, it was considered as a keyframe. In (Savran Kızıltepe et al., 2021), CNN and Recurrent Neural Network (RNN) were used to identify the information regions of each video frame. The frames with higher structural similarity scores were considered as keyframes. Furthermore, keyframe extraction in videos has widespread applications in the field of medical screening. For instance, Huang et al. (Huang et al., 2022) introduced a reinforcement learning-based framework for extracting keyframes in breast ultrasound videos automatically, while Pu et al. (Pu et al., 2021) utilized CNN and RNN to select keyframes in fetal ultrasound videos.

With the growing demand for realistic and high-quality animations in various industries, keyframe extraction in motion capture attracts increasing attention, which is the process of identifying and extracting significant frames from a motion sequence. The curve simplification-based approach is an important kind of keyframe extraction method (Bulut and Capin, 2007; Zhang et al., 2014a). In this method, a single frame from a motion sequence can be viewed as a point within a high-dimensional space, and these points can collectively form a trajectory curve. The keyframes are then selected as the junctions between the simplified curve segments. Bulut et al. (Bulut and Capin, 2007) proposed a curve saliency metric that extracts keyframes from motion curves, where points with saliency values higher than the average saliency value were identified as keyframes. Zhang et al. (Zhang et al., 2014a) employed the Principal Component Analysis (PCA) method to derive feature curves from a motion sequence, and subsequently identified the keyframes by extracting local optimum points along the curves. Another alternative method for keyframe extraction is the clustering-based approach, where similar frames are grouped into clusters. From each cluster, a representative frame is selected as a keyframe for the motion sequence. For instance, Zhang et al. (Zhang et al., 2013) introduced the ISODATA method for clustering all frames in a motion sequence, with the frames closest to the center of each cluster being identified as keyframes. Sun et al. (Sun et al., 2018) proposed an affine propagation clustering algorithm to cluster all the frames, where the center of each cluster was chosen as the keyframe. Furthermore, some researchers apply matrix factorization-based methods to extract the keyframes. They utilize a motion matrix to represent a motion sequence, which is subsequently decomposed into a weight matrix and a keyframe matrix. Huang et al. (Huang et al., 2005) designed the Key Probe method to factorize the motion matrix into two smaller matrixes, then used the least-squares optimization method to solve the keyframe extraction problem.

In recent years, AI-based keyframe extraction methods have also attracted interest from researchers. Genetic Algorithm (GA) is an important heuristic algorithm for decision-making problems. In (Liu et al., 2013), Liu et al. combined GA with a probabilistic simplex method to propose a keyframe extraction algorithm called Simplex Hybrid Genetic Algorithm (SMHGA), which enhanced the data processing speed compared to GA. The keyframes were obtained through an iterative application of the algorithm, where initial populations were generated randomly and intelligently. Zhang et al. (Zhang et al., 2014b) presented a Multiple Population Genetic Algorithm (MPGA) to extract the keyframes of motion capture data by minimizing the reconstruction error. While the SMHGA utilized a single group for evolution, the MPGA employed multiple populations, resulting in a larger search space for the optimization process. Except for GA, Particle Swarm Optimization (PSO), as another prominent heuristic algorithm, can also be used for keyframe extraction. Chang et al. (Chang et al., 2016) designed a keyframe extraction method based on PSO. The iteration of PSO stopped when the variance was smaller than the set value, and then the output was considered as keyframes. However, the heuristic algorithms always take a long time to train. Some sparse represent-based methods (Wang et al., 2011; Elhamifar et al., 2012) can be used to solve the problem. Xia et al. (Xia et al., 2016) proposed a joint kernel sparse representation model to extract the keyframes, which could simultaneously model the sparseness and the Riemannian manifold structure of the human skeleton. Also, some researchers use deep learning methods to solve the keyframe extraction problem. For example, Prabakaran et al. (Prabakaran et al., 2022) first used PCA to reduce the data size, then proposed a new algorithm to extract the keyframes of motion data based on Optimized Convolution Neural Network (OCNN) and Intensity Feature Selection (IFS). A graph-based deep reinforcement learning method for keyframe extraction was proposed by Mo et al. (Mo et al., 2021), in which each frame is represented as a graph.

From the above work, it is envisioned that researchers have done a lot of work on keyframe extraction. As human motion data is captured frame-by-frame, we are much inspired by some video-based approaches. However, these approaches are limited to extracting keyframes from images, which cannot be directly applied to select the keyframes in sequential motion data. Since the bone length is fixed, researchers use mathematical methods to calculate the joints, but these methods are computationally intensive. With the development of AI technology, some researchers use AI-based methods to reduce the computational effort of models. Although these methods are fast and accurate, some of them need to identify the type of motion before training, which is not widely applied. In addition, some AI-based methods convert each frame of motion data to a graph, which also requires high computational cost. Furthermore, existing AI-based methods rely heavily on labeled data for training, but the selected keyframes are labeled according to human intuition, which contains much subjective noise and uncertainty, so the accuracy of the labels is doubtful. As there are no fixed categories of avatar motions in the Metaverse and it is difficult to label keyframes artificially, it is a very challenging task to extract keyframes in the Metaverse.

2.2. Motion Reconstruction

In addition to keyframe extraction algorithms, there is still plenty of work that mainly focuses on the reconstruction of videos and motion capture. Reconstruction techniques aim to enhance the visual quality, resolution, and fidelity of individual frames in a video sequence or motion capture data, which can be used in many different areas.

Video reconstruction refers to the process of generating a new video sequence from a given input, typically with the goal of improving or enhancing the original video. Deep learning methods have recently demonstrated the potential to solve the video reconstruction problem. Weng et al. (Weng et al., 2021) and Rebecq et al. (Rebecq et al., 2019a, b) proposed CNN-based models for video reconstruction, leveraging the inherent capability of CNNs to capture and learn intricate spatial and temporal patterns in video data. Wang et al. (Wang et al., 2019) reconstructed the frames from grayscale videos with Generative Adversarial Networks (GANs). In addition, video reconstruction has a wide range of applications, such as medical imaging (Després and Jia, 2017) and multimedia (Lin et al., 2018).

With the development of motion capture technologies, motion reconstruction has been a significant area of research, aiming to generate natural intermediate motions based on a given start-end keyframe pair. Various techniques and methodologies have been explored in this field. In the early stage, most of the motion reconstruction methods were designed with linear interpolation (Zhang et al., 2014a, 2013; Huang et al., 2005; Zhang et al., 2014b), providing a simple and straightforward way for motion reconstruction. However, linear interpolation may result in incoherent motions, especially when significant variations exist between consecutive frames. Polynomial interpolation, on the other hand, offers more flexibility by fitting curves or polynomials to the motion sequences (Howarth and Callaghan, 2010; Halici and Bostanci, 2019), enabling smoother and more accurate reconstruction of motion trajectories compared with linear interpolation. However, the bone length should be a constant value. Otherwise, the movements will appear incoherent and stiff. Both linear interpolation and polynomial interpolation change the length of the bones, resulting in lower accuracy of motion reconstruction. To solve this problem, the Spherical Linear Interpolation (Slerp) method has been explored to reconstruct the motion (Sun et al., 2018; Chang et al., 2016; Halit and Capin, 2011), ensuring that the bone length remains constant during the interpolation. The Slerp method interpolates between two quaternions along the shortest path on the unit sphere, which maintains a constant angular velocity during the interpolation process. Additionally, motion reconstruction based on deep learning has gained significant attention in recent years. Xia et al. (Xia et al., 2019) proposed a Sphere Nonlinear Interpolation (Snerp) model to reconstruct the motion based on paired dictionary learning and sphere interpolation. In contrast to Slerp, Snerp does not maintain a constant angular velocity during interpolation, resulting in motion that closely resembles human skeletal movement and thereby improving the accuracy of motion reconstruction. Kim et al. (Kim et al., 2021) utilized a DNN with an attention mechanism to reconstruct the motion, which could effectively observe the long-term information with the attention layers.

The work mentioned above has demonstrated that researchers have made significant advancements in motion reconstruction in recent years. Few works have been done with linear interpolation and polynomial interpolation, but the reconstructed motion appears to be incoherent and stiff due to the non-fixed length of the bones. While the Slerp method effectively addresses this issue, its drawback lies in maintaining a constant angular velocity during interpolation, leading to motion that does not closely resemble human skeletal movement. Also, deep learning-based methods always rely on labeled data for training, which cannot be applied to motion synchronization scenarios. Nevertheless, this kind of method always needs to identify the type of motion before training, which cannot be widely applied. However, some motion capture devices can directly support the calculation of velocity information nowadays, further increasing the amount of motion information. As most of the existing motion reconstruction methods do not involve velocity, the information in non-keyframes is not fully exploited. Here, we propose a SIDQL framework for keyframe extraction and motion reconstruction. The polynomial interpolation method is used to reconstruct the motion of the root point, while a new spherical interpolation-based method is utilized to generate the motion of other skeletal points. In addition, DQL is used to extract the keyframes without using labeled data.

3. PROBLEM FORMULATION

To solve the challenges of keyframe extraction and motion reconstruction in motion capture, a new math model of the whole procedure is developed. We first convert the sequences into spherical coordinates and propose a new motion reconstruction method utilizing the location and velocity information. The keyframe extraction problem is then formalized into an optimization problem, which aims to minimize the reconstruction error. For convenience, the major notations used in this paper are listed in Table 1.

TABLE 1. Major Notations
Notation Description
N The number of frames in a sequence.
M The number of points in a frame.
Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT The nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT frame in sequence S.
rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT The length of the mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT bone.
θmsubscript𝜃𝑚\theta_{m}italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT The angle value between the vector of the mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT bone and the z-axis.
ϕmsubscriptitalic-ϕ𝑚\phi_{m}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT The angle value between the vector of the mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT bone and the y-axis.
(θn,m,ϕn,m)subscript𝜃𝑛𝑚subscriptitalic-ϕ𝑛𝑚(\theta_{n,m},\phi_{n,m})( italic_θ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) The mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point in the nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT frame.
(θ˙n,m,ϕ˙n,m)subscript˙𝜃𝑛𝑚subscript˙italic-ϕ𝑛𝑚(\dot{\theta}_{n,m},\dot{\phi}_{n,m})( over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , over˙ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) The velocity of mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point in the nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT frame.
(θ~n,m,ϕ~n,m)subscript~𝜃𝑛𝑚subscript~italic-ϕ𝑛𝑚(\tilde{\theta}_{n,m},\tilde{\phi}_{n,m})( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) The mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point in the nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT reconstructed frame.
ΔΔ\Deltaroman_Δ The time interval between two adjacent frames.
K The keyframe extraction decision of sequence S.
kwsubscript𝑘𝑤k_{w}italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT The index of the wthsuperscript𝑤𝑡w^{th}italic_w start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT keyframe.
E The mean error after reconstruction.

3.1. Motion Reconstruction

Motion reconstruction of motion capture is a problem that uses two frames in a sequence to reconstruct the middle frames via different approaches. By using reconstruction techniques, we increase the frame rate of the motion, improve the consistency of movement and reduce the size of transmitted data. Since the bone length is fixed, the traditional interpolation algorithm in the Cartesian coordinate system fails to ensure a constant length, which will greatly influence the watching experience. In addition, linear interpolation approaches will inevitably influence the smoothness between different sections. As a consequence, we proposed a new motion reconstruction algorithm with polynomial interpolation in the spherical coordinate system. The framework of motion reconstruction is shown in Fig 1. ASF files contain information about the skeletal structure of an actor and the joints, while AMC files describe the actual movements of the joints in an actor’s skeletal structure. AMC files are used in conjunction with ASF files to create a complete representation of an actor’s movements, which we use for research in this paper.

Refer to caption

Figure 1. The procedure of reconstruction.

For any point (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) in the Cartesian coordinate system, it can be represented in the spherical coordinate system using (r,θ,ϕ)𝑟𝜃italic-ϕ(r,\theta,\phi)( italic_r , italic_θ , italic_ϕ ). The equation for converting the Cartesian coordinate system to the spherical coordinate system is given as:

(1) {r=x2+y2+z2θ=arccos(zr)ϕ=arctan(yx)cases𝑟superscript𝑥2superscript𝑦2superscript𝑧2otherwise𝜃𝑧𝑟otherwiseitalic-ϕ𝑦𝑥otherwise\displaystyle\begin{cases}r=\sqrt{x^{2}+y^{2}+z^{2}}\\ \theta=\arccos{(\frac{z}{r})}\\ \phi=\arctan{(\frac{y}{x})}\end{cases}{ start_ROW start_CELL italic_r = square-root start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_θ = roman_arccos ( divide start_ARG italic_z end_ARG start_ARG italic_r end_ARG ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_ϕ = roman_arctan ( divide start_ARG italic_y end_ARG start_ARG italic_x end_ARG ) end_CELL start_CELL end_CELL end_ROW

We then assume that vcsubscript𝑣𝑐\vec{v}_{c}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and vssubscript𝑣𝑠\vec{v}_{s}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the velocity of the point in the Cartesian coordinate system and in the spherical coordinate system. According to the operation of the vector, vcsubscript𝑣𝑐\vec{v}_{c}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and vssubscript𝑣𝑠\vec{v}_{s}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT can be represented as:

(2) vc=x˙+y˙+z˙subscript𝑣𝑐˙𝑥˙𝑦˙𝑧\vec{v}_{c}=\dot{x}+\dot{y}+\dot{z}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = over˙ start_ARG italic_x end_ARG + over˙ start_ARG italic_y end_ARG + over˙ start_ARG italic_z end_ARG
(3) vs=r˙+θ˙+ϕ˙subscript𝑣𝑠˙𝑟˙𝜃˙italic-ϕ\vec{v}_{s}=\dot{r}+\dot{\theta}+\dot{\phi}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = over˙ start_ARG italic_r end_ARG + over˙ start_ARG italic_θ end_ARG + over˙ start_ARG italic_ϕ end_ARG

Since in the Cartesian coordinate system, the vector can be decomposed according to different orthogonal unit vectors. vcsubscript𝑣𝑐\vec{v}_{c}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can also be decomposed according to the direction of (r,θ,ϕ)𝑟𝜃italic-ϕ(r,\theta,\phi)( italic_r , italic_θ , italic_ϕ ), which is represented as (r^,θ^,ϕ^)^𝑟^𝜃^italic-ϕ(\widehat{r},\widehat{\theta},\widehat{\phi})( over^ start_ARG italic_r end_ARG , over^ start_ARG italic_θ end_ARG , over^ start_ARG italic_ϕ end_ARG ). Taking the relationship between angular velocity and linear velocity into consideration, we derive the equations of the velocity:

(4) vc=r˙r^+rθ˙θ^+rsinθϕ˙ϕ^subscript𝑣𝑐˙𝑟^𝑟𝑟˙𝜃^𝜃𝑟𝜃˙italic-ϕ^italic-ϕ\vec{v_{c}}=\dot{r}\widehat{r}+r\dot{\theta}\widehat{\theta}+r\sin{\theta}\dot% {\phi}\widehat{\phi}over→ start_ARG italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG = over˙ start_ARG italic_r end_ARG over^ start_ARG italic_r end_ARG + italic_r over˙ start_ARG italic_θ end_ARG over^ start_ARG italic_θ end_ARG + italic_r roman_sin italic_θ over˙ start_ARG italic_ϕ end_ARG over^ start_ARG italic_ϕ end_ARG

Combined (1) (2)with (4), we can obtain the first derivative of r, θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ as:

(5) r˙=sinθcosϕx˙+sinθsinϕy˙+cosθz˙˙𝑟𝜃italic-ϕ˙𝑥𝜃italic-ϕ˙𝑦𝜃˙𝑧\dot{r}=\sin{\theta}\cos{\phi}\dot{x}+\sin{\theta}\sin{\phi}\dot{y}+\cos{% \theta}\dot{z}\quadover˙ start_ARG italic_r end_ARG = roman_sin italic_θ roman_cos italic_ϕ over˙ start_ARG italic_x end_ARG + roman_sin italic_θ roman_sin italic_ϕ over˙ start_ARG italic_y end_ARG + roman_cos italic_θ over˙ start_ARG italic_z end_ARG
(6) θ˙=cosθcosϕrx˙+cosθsinϕry˙sinθrz˙˙𝜃𝜃italic-ϕ𝑟˙𝑥𝜃italic-ϕ𝑟˙𝑦𝜃𝑟˙𝑧\dot{\theta}=\frac{\cos{\theta}\cos{\phi}}{r}\dot{x}+\frac{\cos{\theta}\sin{% \phi}}{r}\dot{y}-\frac{\sin{\theta}}{r}\dot{z}over˙ start_ARG italic_θ end_ARG = divide start_ARG roman_cos italic_θ roman_cos italic_ϕ end_ARG start_ARG italic_r end_ARG over˙ start_ARG italic_x end_ARG + divide start_ARG roman_cos italic_θ roman_sin italic_ϕ end_ARG start_ARG italic_r end_ARG over˙ start_ARG italic_y end_ARG - divide start_ARG roman_sin italic_θ end_ARG start_ARG italic_r end_ARG over˙ start_ARG italic_z end_ARG
(7) ϕ˙=sinϕrsinθx˙+cosϕrsinθy˙˙italic-ϕitalic-ϕ𝑟𝜃˙𝑥italic-ϕ𝑟𝜃˙𝑦\dot{\phi}=-\frac{\sin{\phi}}{r\sin{\theta}}\dot{x}+\frac{\cos{\phi}}{r\sin{% \theta}}\dot{y}\qquad\qquad\qquad\quadover˙ start_ARG italic_ϕ end_ARG = - divide start_ARG roman_sin italic_ϕ end_ARG start_ARG italic_r roman_sin italic_θ end_ARG over˙ start_ARG italic_x end_ARG + divide start_ARG roman_cos italic_ϕ end_ARG start_ARG italic_r roman_sin italic_θ end_ARG over˙ start_ARG italic_y end_ARG

As the bone length is fixed for a human, r is a constant that represents the bone length. The spherical coordinate system with a fixed r can also be called spherical polar coordinates. From r˙˙𝑟\dot{r}over˙ start_ARG italic_r end_ARG=0, The equations 6 and 7 can be simplified as:

(8) {θ˙=z˙rsinθϕ˙=y˙sinθ+z˙cosθsinϕrsin2θcosϕcases˙𝜃˙𝑧𝑟𝜃otherwise˙italic-ϕ˙𝑦𝜃˙𝑧𝜃italic-ϕ𝑟superscript2𝜃italic-ϕotherwise\displaystyle\begin{cases}\dot{\theta}=-\frac{\dot{z}}{r\sin{\theta}}\\ \dot{\phi}=\frac{\dot{y}\sin{\theta}+\dot{z}\cos{\theta}\sin{\phi}}{r\sin^{2}{% \theta}\cos{\phi}}\end{cases}{ start_ROW start_CELL over˙ start_ARG italic_θ end_ARG = - divide start_ARG over˙ start_ARG italic_z end_ARG end_ARG start_ARG italic_r roman_sin italic_θ end_ARG end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_ϕ end_ARG = divide start_ARG over˙ start_ARG italic_y end_ARG roman_sin italic_θ + over˙ start_ARG italic_z end_ARG roman_cos italic_θ roman_sin italic_ϕ end_ARG start_ARG italic_r roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ roman_cos italic_ϕ end_ARG end_CELL start_CELL end_CELL end_ROW

By using the coordinate system transformation equations 8, we can convert all the points in a sequence from the Cartesian coordinate system to the spherical coordinate system. We denote the sequence as S={Fn|1nN}𝑆conditional-setsubscript𝐹𝑛1𝑛𝑁S=\{F_{n}\big{|}1\leq n\leq N\}italic_S = { italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | 1 ≤ italic_n ≤ italic_N }, where Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame in the sequence, and N𝑁Nitalic_N is the total number of the frames. Subsequently, for each frame, Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, each point of the frame can be represented by θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ. We can further denote Fn={(θn,m,ϕn,m)|1mM}subscript𝐹𝑛conditional-setsubscript𝜃𝑛𝑚subscriptitalic-ϕ𝑛𝑚1𝑚𝑀F_{n}=\{(\theta_{n,m},\phi_{n,m})\big{|}1\leq m\leq M\}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( italic_θ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) | 1 ≤ italic_m ≤ italic_M }, where (θn,m,ϕn,m)subscript𝜃𝑛𝑚subscriptitalic-ϕ𝑛𝑚(\theta_{n,m},\phi_{n,m})( italic_θ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) is the mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point in the frame and M𝑀Mitalic_M is the the amount of points in each frame. Moreover, the position of the points is also related to the length of the bones, which can be denoted as R={rm|1mM}𝑅conditional-setsubscript𝑟𝑚1𝑚𝑀R=\{r_{m}\big{|}1\leq m\leq M\}italic_R = { italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | 1 ≤ italic_m ≤ italic_M }. Since R𝑅Ritalic_R represents the length of the bones, it is fixed in a given sequence. So, R𝑅Ritalic_R is not included during reconstruction.

For two frames F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and FKsubscript𝐹𝐾F_{K}italic_F start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, we assume that the overall movement follows polynomial functions:

(9) {θt=λ=0λmaxAλtλϕt=λ=0λmaxBλtλ\displaystyle\left\{\begin{aligned} \theta_{t}=\sum_{\lambda=0}^{\lambda^{max}% }A_{\lambda}t^{\lambda}\\ \phi_{t}=\sum_{\lambda=0}^{\lambda^{max}}B_{\lambda}t^{\lambda}\end{aligned}\right.{ start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_λ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_λ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_CELL end_ROW

We set the frame F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as the frame in time t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and FKsubscript𝐹𝐾F_{K}italic_F start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT as the frame in time tKsubscript𝑡𝐾t_{K}italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. For the mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point we now have the information (θ0,m,ϕ0,m)subscript𝜃0𝑚subscriptitalic-ϕ0𝑚(\theta_{0,m},\phi_{0,m})( italic_θ start_POSTSUBSCRIPT 0 , italic_m end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 0 , italic_m end_POSTSUBSCRIPT ) and (θK,m,ϕK,m)subscript𝜃𝐾𝑚subscriptitalic-ϕ𝐾𝑚(\theta_{K,m},\phi_{K,m})( italic_θ start_POSTSUBSCRIPT italic_K , italic_m end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_K , italic_m end_POSTSUBSCRIPT ). Also, we get the velocity information (θ˙0,m,ϕ˙0,m)subscript˙𝜃0𝑚subscript˙italic-ϕ0𝑚(\dot{\theta}_{0,m},\dot{\phi}_{0,m})( over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 , italic_m end_POSTSUBSCRIPT , over˙ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT 0 , italic_m end_POSTSUBSCRIPT ) and (θ˙K,m,ϕ˙K,m)subscript˙𝜃𝐾𝑚subscript˙italic-ϕ𝐾𝑚(\dot{\theta}_{K,m},\dot{\phi}_{K,m})( over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_K , italic_m end_POSTSUBSCRIPT , over˙ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_K , italic_m end_POSTSUBSCRIPT ). Then, four equations can be obtained by inputting the values into the function for each parameter. So we set the highest power λmax=3superscript𝜆𝑚𝑎𝑥3\lambda^{max}=3italic_λ start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT = 3, and the function can be solved using the location and velocity information of F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and FKsubscript𝐹𝐾F_{K}italic_F start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, which can be denoted as:

(10) {Θm(t)=λ=03Am,λtλΦm(t)=λ=03Bm,λtλ\displaystyle\left\{\begin{aligned} \Theta_{m}(t)=\sum_{\lambda=0}^{3}A_{m,% \lambda}t^{\lambda}\\ \Phi_{m}(t)=\sum_{\lambda=0}^{3}B_{m,\lambda}t^{\lambda}\end{aligned}\right.{ start_ROW start_CELL roman_Θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_λ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_m , italic_λ end_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Φ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_λ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_m , italic_λ end_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_CELL end_ROW

As the time interval between two adjacent frames should be the same in a sequence, we denote the time interval as ΔΔ\Deltaroman_Δ. The mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT of the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame in the reconstructed sequence can be derived by:

(11) {θ~k,m=Θm(t0+kΔ)ϕ~k,m=Φm(t0+kΔ)\displaystyle\left\{\begin{aligned} \tilde{\theta}_{k,m}=\Theta_{m}(t_{0}+k% \Delta)\\ \tilde{\phi}_{k,m}=\Phi_{m}(t_{0}+k\Delta)\end{aligned}\right.{ start_ROW start_CELL over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_k roman_Δ ) end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_k roman_Δ ) end_CELL end_ROW

After reconstruct all the sections in the sequence, we can get the new sequence S~={F~n|1nN}~𝑆conditional-setsubscript~𝐹𝑛1𝑛𝑁\tilde{S}=\{\tilde{F}_{n}\big{|}1\leq n\leq N\}over~ start_ARG italic_S end_ARG = { over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | 1 ≤ italic_n ≤ italic_N }, where F~n={(θ~n,m,ϕ~n,m)|1mM}subscript~𝐹𝑛conditional-setsubscript~𝜃𝑛𝑚subscript~italic-ϕ𝑛𝑚1𝑚𝑀\tilde{F}_{n}=\{(\tilde{\theta}_{n,m},\tilde{\phi}_{n,m})\big{|}1\leq m\leq M\}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) | 1 ≤ italic_m ≤ italic_M }.

To determine the location of each point using the spherical coordinate system, we need to know the coordinates of the root point in the Cartesian coordinate system. Since the root point does not have an origin point, it can not be determined using the spherical coordinate system. This also means that the root point will not involve changes in the bone length during reconstruction, which makes the interpolation algorithm in the Cartesian coordinate system a good choice. Also, compared with all the other points, the root point is the most stable point. As a result, the trajectory of the root point can be well reconstructed with polynomial interpolation in the Cartesian coordinate system.

Similarly, we first denote the coordinate of the root point in frame F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and FKsubscript𝐹𝐾F_{K}italic_F start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT as O0subscript𝑂0O_{0}italic_O start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and OKsubscript𝑂𝐾O_{K}italic_O start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. We also have the velocity of the root point, which is shown as O˙0subscript˙𝑂0\dot{O}_{0}over˙ start_ARG italic_O end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and O˙Ksubscript˙𝑂𝐾\dot{O}_{K}over˙ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. By substituting the four values, we can calculate the parameter of another polynomial equation:

(12) Γ(t)=λ=03CλtλΓ𝑡superscriptsubscript𝜆03subscript𝐶𝜆superscript𝑡𝜆\Gamma(t)=\sum_{\lambda=0}^{3}C_{\lambda}t^{\lambda}roman_Γ ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_λ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT

from which the position of the root point of the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame can be obtained via:

(13) Ok=Γ(t0+kΔ)subscript𝑂𝑘Γsubscript𝑡0𝑘ΔO_{k}=\Gamma(t_{0}+k\Delta)italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_Γ ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_k roman_Δ )

3.2. Keyframe Extraction

Keyframe extraction of motion capture is one of the techniques used to determine the most critical frames in a motion sequence. The extracted keyframes can be used to achieve various goals, e.g., summarize the sequence for the users using several frames and enable efficient storage. More importantly, by combining keyframe extraction and reconstruction methods, we can significantly decrease the bandwidth requirement, which can be used to handle the network congestion and decrease the latency.

Most existing work chooses the labeled keyframes of the ground truth by humans. However, such a kind of ground truth keyframes is not the optimal decision for reconstruction. To achieve near-optimal reconstruction performance, we want to find an intelligent way to extract the proper keyframes for the proposed reconstruction algorithm. The keyframes extraction problem considering the reconstruction process will be elaborated in this section.

For each given sequence S={Fn|1nN}𝑆conditional-setsubscript𝐹𝑛1𝑛𝑁S=\{F_{n}\big{|}1\leq n\leq N\}italic_S = { italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | 1 ≤ italic_n ≤ italic_N }, we denote the keyframe extraction decision as K={kw|1wW}𝐾conditional-setsubscript𝑘𝑤1𝑤𝑊K=\{k_{w}\big{|}1\leq w\leq W\}italic_K = { italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | 1 ≤ italic_w ≤ italic_W }, where kwsubscript𝑘𝑤k_{w}italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is the index of the wthsuperscript𝑤𝑡w^{th}italic_w start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT keyframe and W𝑊Witalic_W is the total number of extracted keyframes. For the section between the wthsuperscript𝑤𝑡w^{th}italic_w start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT keyframe and the (w+1)thsuperscript𝑤1𝑡(w+1)^{th}( italic_w + 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT keyframe, we apply the reconstruction algorithm and calculate out the polynomial equation Θmw(t)superscriptsubscriptΘ𝑚𝑤𝑡\Theta_{m}^{w}(t)roman_Θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) and Φmw(t)superscriptsubscriptΦ𝑚𝑤𝑡\Phi_{m}^{w}(t)roman_Φ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) for the mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point.

For all the points, we can use the same process and derive the trajectory equations Θw(t)={Θ1w(t),Θ2w(t),,ΘMw(t)}superscriptΘ𝑤𝑡superscriptsubscriptΘ1𝑤𝑡superscriptsubscriptΘ2𝑤𝑡superscriptsubscriptΘ𝑀𝑤𝑡\Theta^{w}(t)=\{\Theta_{1}^{w}(t),\Theta_{2}^{w}(t),\cdots,\Theta_{M}^{w}(t)\}roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) = { roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) , roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) , ⋯ , roman_Θ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) } and Φw(t)={Φ1w(t),Φ2w(t),,ΦMw(t)}superscriptΦ𝑤𝑡superscriptsubscriptΦ1𝑤𝑡superscriptsubscriptΦ2𝑤𝑡superscriptsubscriptΦ𝑀𝑤𝑡\Phi^{w}(t)=\{\Phi_{1}^{w}(t),\Phi_{2}^{w}(t),\cdots,\Phi_{M}^{w}(t)\}roman_Φ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) = { roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) , roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) , ⋯ , roman_Φ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ) } for the wthsuperscript𝑤𝑡w^{th}italic_w start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT section. Then for any point in the qthsuperscript𝑞𝑡q^{th}italic_q start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT reconstructed frame, the position can be obtained by:

(14) {θ~kw+q,m=Θmw(tkw+qΔ)ϕ~kw+q,m=Φmw(tkw+qΔ)\displaystyle\left\{\begin{aligned} \tilde{\theta}_{k_{w}+q,m}=\Theta_{m}^{w}(% t_{k_{w}}+q\Delta)\\ \tilde{\phi}_{k_{w}+q,m}=\Phi_{m}^{w}(t_{k_{w}}+q\Delta)\end{aligned}\right.{ start_ROW start_CELL over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_q , italic_m end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_q roman_Δ ) end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_q , italic_m end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_q roman_Δ ) end_CELL end_ROW

where tkwsubscript𝑡subscript𝑘𝑤t_{k_{w}}italic_t start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the timestamp of the wthsuperscript𝑤𝑡w^{th}italic_w start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT keyframe. Therefore the total error between the original frames and the new frames in this section for θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ can be calculated by:

(15) {Eθw=q=0Δkm=1M|Θmw(tkw+qΔ)θkw+q,m|Eϕw=q=0Δkm=1M|Φmw(tkw+qΔ)ϕkw+q,m|\left\{\begin{aligned} E_{\theta}^{w}=\sum_{q=0}^{\Delta k}\sum_{m=1}^{M}\big{% |}\Theta_{m}^{w}(t_{k_{w}}+q\Delta)-\theta_{k_{w}+q,m}\big{|}\\ E_{\phi}^{w}=\sum_{q=0}^{\Delta k}\sum_{m=1}^{M}\big{|}\Phi_{m}^{w}(t_{k_{w}}+% q\Delta)-\phi_{k_{w}+q,m}\big{|}\end{aligned}\right.{ start_ROW start_CELL italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_q = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | roman_Θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_q roman_Δ ) - italic_θ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_q , italic_m end_POSTSUBSCRIPT | end_CELL end_ROW start_ROW start_CELL italic_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_q = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | roman_Φ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_q roman_Δ ) - italic_ϕ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_q , italic_m end_POSTSUBSCRIPT | end_CELL end_ROW

where Eθwsuperscriptsubscript𝐸𝜃𝑤E_{\theta}^{w}italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT is the total error of θ𝜃\thetaitalic_θ, Eϕwsuperscriptsubscript𝐸italic-ϕ𝑤E_{\phi}^{w}italic_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT is the total error of ϕitalic-ϕ\phiitalic_ϕ and ΔkΔ𝑘\Delta kroman_Δ italic_k is the frames number in the section. So, the total error of section w𝑤witalic_w can be given as:

(16) Ew=Eθw+Eϕwsuperscript𝐸𝑤superscriptsubscript𝐸𝜃𝑤superscriptsubscript𝐸italic-ϕ𝑤E^{w}=E_{\theta}^{w}+E_{\phi}^{w}italic_E start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT + italic_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT

We repeat the same reconstruction process for all the sections decided by the keyframe extraction decision K𝐾Kitalic_K. The overall trajectory equations are derived, which can be written as Θ(t)={Θ1(t),Θ2(t),,ΘW1(t)}Θ𝑡superscriptΘ1𝑡superscriptΘ2𝑡superscriptΘ𝑊1𝑡\Theta(t)=\{\Theta^{1}(t),\Theta^{2}(t),\cdots,\Theta^{W-1}(t)\}roman_Θ ( italic_t ) = { roman_Θ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_t ) , roman_Θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) , ⋯ , roman_Θ start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT ( italic_t ) } and Φ(t)={Φ1(t),Φ2(t),,ΦW1(t)}Φ𝑡superscriptΦ1𝑡superscriptΦ2𝑡superscriptΦ𝑊1𝑡\Phi(t)=\{\Phi^{1}(t),\Phi^{2}(t),\cdots,\Phi^{W-1}(t)\}roman_Φ ( italic_t ) = { roman_Φ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_t ) , roman_Φ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) , ⋯ , roman_Φ start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT ( italic_t ) }. After that, we combined each section into the constructed new sequence S~={F~n|1nN}~𝑆conditional-setsubscript~𝐹𝑛1𝑛𝑁\tilde{S}=\{\tilde{F}_{n}\big{|}1\leq n\leq N\}over~ start_ARG italic_S end_ARG = { over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | 1 ≤ italic_n ≤ italic_N }, where F~n={(θ~n,m,ϕ~n,m)|1mM}subscript~𝐹𝑛conditional-setsubscript~𝜃𝑛𝑚subscript~italic-ϕ𝑛𝑚1𝑚𝑀\tilde{F}_{n}=\{(\tilde{\theta}_{n,m},\tilde{\phi}_{n,m})\big{|}1\leq m\leq M\}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) | 1 ≤ italic_m ≤ italic_M }.

Thereafter, the mean error between S𝑆Sitalic_S and S~~𝑆\tilde{S}over~ start_ARG italic_S end_ARG is denoted as E𝐸Eitalic_E. Since the keyframes will be the same during reconstruction, the total error can be derived by summing the error of each section. Then we have the function:

(17) E=1N1Mw=1W1Ew=1N1Mw=1W1(Eθw+Eϕw)𝐸1𝑁1𝑀superscriptsubscript𝑤1𝑊1superscript𝐸𝑤1𝑁1𝑀superscriptsubscript𝑤1𝑊1superscriptsubscript𝐸𝜃𝑤superscriptsubscript𝐸italic-ϕ𝑤E=\frac{1}{N}\frac{1}{M}\sum_{w=1}^{W-1}E^{w}=\frac{1}{N}\frac{1}{M}\sum_{w=1}% ^{W-1}(E_{\theta}^{w}+E_{\phi}^{w})italic_E = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT + italic_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT )

Since that, the points are generated by Θ(t)Θ𝑡\Theta(t)roman_Θ ( italic_t ) and Φw(t)superscriptΦ𝑤𝑡\Phi^{w}(t)roman_Φ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t ), which can be determined based on sequence S𝑆Sitalic_S and the keyframe extraction decision K𝐾Kitalic_K. Other parameters, such as point number M𝑀Mitalic_M, frames number N𝑁Nitalic_N, keyframe number W𝑊Witalic_W, and the time interval between frames ΔΔ\Deltaroman_Δ are fixed hyperparamter, which are given by the application environment. Resulting from that, the mean error can be determined by S𝑆Sitalic_S and K𝐾Kitalic_K. We further defined the error function Q𝑄Qitalic_Q, which can be formulated as:

Q(S,K)=𝑄𝑆𝐾absent\displaystyle Q(S,K)=italic_Q ( italic_S , italic_K ) = 1N1Mw=1W1q=0Δkm=1M(|Θmw(tkw+qΔ)\displaystyle\frac{1}{N}\frac{1}{M}\sum_{w=1}^{W-1}\sum_{q=0}^{\Delta k}\sum_{% m=1}^{M}(\big{|}\Theta_{m}^{w}(t_{k_{w}}+q\Delta)-divide start_ARG 1 end_ARG start_ARG italic_N end_ARG divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( | roman_Θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_q roman_Δ ) -
(18) θkw+q,m|+|Φmw(tkw+qΔ)ϕkw+q,m|)\displaystyle\theta_{k_{w}+q,m}\big{|}+\big{|}\Phi_{m}^{w}(t_{k_{w}}+q\Delta)-% \phi_{k_{w}+q,m}\big{|})italic_θ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_q , italic_m end_POSTSUBSCRIPT | + | roman_Φ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_q roman_Δ ) - italic_ϕ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_q , italic_m end_POSTSUBSCRIPT | )

From the error function Q(S,K)𝑄𝑆𝐾Q(S,K)italic_Q ( italic_S , italic_K ), we can see that for the given sequence S𝑆Sitalic_S, the reconstruction error can be calculated by the keyframe extraction decision K𝐾Kitalic_K. Therefore, our target for finding the best keyframes for the proposed reconstruction algorithm can be transformed into an optimization problem (𝒫𝒫\mathcal{P}caligraphic_P):

(19a) (𝒫)min:Q(S,K):𝒫𝑚𝑖𝑛𝑄𝑆𝐾\quad(\mathcal{P})\quad min:Q(S,K)\qquad\qquad\qquad\qquad( caligraphic_P ) italic_m italic_i italic_n : italic_Q ( italic_S , italic_K )
(19b) s.t.:kw,1wWs.t.:k_{w}\in\mathbb{Z},1\leq w\leq W\quaditalic_s . italic_t . : italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_Z , 1 ≤ italic_w ≤ italic_W
(19c) 1<kw<N,1wWformulae-sequence1subscript𝑘𝑤𝑁1𝑤𝑊\quad\qquad 1<k_{w}<N,1\leq w\leq W1 < italic_k start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT < italic_N , 1 ≤ italic_w ≤ italic_W
(19d) kikj,1i,jW,ijformulae-sequencesubscript𝑘𝑖subscript𝑘𝑗formulae-sequence1𝑖formulae-sequence𝑗𝑊𝑖𝑗\qquad\qquad k_{i}\neq k_{j},1\leq i,j\leq W,i\neq jitalic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , 1 ≤ italic_i , italic_j ≤ italic_W , italic_i ≠ italic_j

The problem (𝒫𝒫\mathcal{P}caligraphic_P) is an optimization problem with a very large search space and is hard to solve by traditional optimization algorithms. To solve this problem with efficiency, we proposed a new machine learning algorithm using DQN. The next section elaborates on the details.

4. SIDQL ALGORITHM

Refer to caption

Figure 2. The procedure of SIDQL.

This section elaborates on our algorithm for solving the problem (𝒫𝒫\mathcal{P}caligraphic_P). The sequences are transformed to the spherical coordinate system, and a DQN is used to extract the keyframes. A special reward function is also proposed to further ensure the accuracy of the proposed reconstruction approach. Fig 2 shows the framework of the whole process.

4.1. Framework

For any sequence S𝑆Sitalic_S, we first convert the location data (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) and velocity data vcsubscript𝑣𝑐\vec{v}_{c}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT into the spherical coordinate system using 1 and 8. Since that r𝑟ritalic_r is constant for the sequence, the input data of the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame can be represented by (Fn,Vn)subscript𝐹𝑛subscript𝑉𝑛(F_{n},V_{n})( italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where Fn={(θn,m,ϕn,m)|1mM}subscript𝐹𝑛conditional-setsubscript𝜃𝑛𝑚subscriptitalic-ϕ𝑛𝑚1𝑚𝑀F_{n}=\{(\theta_{n,m},\phi_{n,m})\big{|}1\leq m\leq M\}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( italic_θ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) | 1 ≤ italic_m ≤ italic_M } and Vn={(θ˙n,m,ϕ˙n,m)|1mM}subscript𝑉𝑛conditional-setsubscript˙𝜃𝑛𝑚subscript˙italic-ϕ𝑛𝑚1𝑚𝑀V_{n}=\{(\dot{\theta}_{n,m},\dot{\phi}_{n,m})\big{|}1\leq m\leq M\}italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , over˙ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) | 1 ≤ italic_m ≤ italic_M }. Considering the input format of DQN, we denote the keyframe extraction decision K𝐾Kitalic_K as a 𝒳={x1,x2,,xN}𝒳subscript𝑥1subscript𝑥2subscript𝑥𝑁\mathcal{X}=\{x_{1},x_{2},\dots,x_{N}\}caligraphic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, which can be defined by:

(20) xn={1,if the nth frame is keyframe,0,if the nth frame is not keyframe.subscript𝑥𝑛cases1if the nth frame is keyframe0if the nth frame is not keyframex_{n}=\begin{cases}1,&\textrm{if the $n^{th}$ frame is keyframe},\\ 0,&\textrm{if the $n^{th}$ frame is not keyframe}.\end{cases}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL if the italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame is keyframe , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL if the italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame is not keyframe . end_CELL end_ROW

Deep Q-Learning is one of the typical algorithms of DRL, which can be used to solve the decision-making problem of a Markov Decision Process (MDP). To apply the DQL algorithm, we regard the keyframe extraction process as a MDP process. For each time step, we add one new keyframe to 𝒳𝒳\mathcal{X}caligraphic_X until we reach the desired keyframe number.

We represent si={Fi,Vi,𝒳i}superscript𝑠𝑖superscript𝐹𝑖superscript𝑉𝑖superscript𝒳𝑖s^{i}=\{F^{i},V^{i},\mathcal{X}^{i}\}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } is the state when extracting the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT new keyframe. To meet the prerequisite of reconstruction, we set the first frame and the last frame as two initial keyframes. For the sequence with N𝑁Nitalic_N frames, the actions space contains N𝑁Nitalic_N different actions. Specifically, the action ai=msuperscript𝑎𝑖𝑚a^{i}=mitalic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_m means at state sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we add the mthsuperscript𝑚𝑡m^{th}italic_m start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame to 𝒳𝒳\mathcal{X}caligraphic_X as the new keyframe. Additionally, the keyframe is chosen from the non-keyframes.

Although all keyframes should be treated fairly and the keyframes should be extracted simultaneously, DQN has the ability to generate actions for maximizing long-term reward. As a result, the impact of converting the decision-making process into iterations can be ignored. Theoretically, the reconstructed sequence has the lowest accuracy at state s0superscript𝑠0s^{0}italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, where all the middle frames are reconstructed from the first frame and the last frame. When all the frames are considered keyframes, the reconstructed sequence is the sequence S𝑆Sitalic_S itself. We denote the reconstruction in the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT state as Qisuperscript𝑄𝑖Q^{i}italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, which can be calculated by (18). We then set a normalized relative error Risuperscript𝑅𝑖R^{i}italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT as:

(21) Ri=1QiQ0superscript𝑅𝑖1superscript𝑄𝑖superscript𝑄0R^{i}=1-\frac{Q^{i}}{Q^{0}}italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 1 - divide start_ARG italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_Q start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG

where Risuperscript𝑅𝑖R^{i}italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is more close to one if sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT achieve higher accuracy. To determine the reward risuperscript𝑟𝑖r^{i}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from each action, we compare the normalized relative error between and after the action aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Then risuperscript𝑟𝑖r^{i}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT can be denoted as:

(22) ri=Ri+1Risuperscript𝑟𝑖superscript𝑅𝑖1superscript𝑅𝑖r^{i}=R^{i+1}-R^{i}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT - italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT

Consequently, the keyframe extraction process can be regarded as one MDP process represented with state sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, action aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and reward risuperscript𝑟𝑖r^{i}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Due to the infinite state space, considering the different positions and velocities of each point, traditional RL approaches are unable to handle this problem. DRL is then proposed by combining RL with DNN to tackle the infinite state space. Substituting different state data into DQN can output the Q-value of each possible action. Q-value is the value used to represent the scores for adapting each action. The higher the Q value, the higher the expected long-term reward after adopting the corresponding action. After the DQN is well-trained, we can determine the best keyframe extraction decision by choosing the action with the highest Q-value.

4.2. Train

To train the DQN, we first collect a training dataset consisting of various sequences. When the iteration begins, we input one sequence into the model from state s0superscript𝑠0s^{0}italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. Then, in each time step for state sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we can obtain action aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, reward risuperscript𝑟𝑖r^{i}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and the next state si+1superscript𝑠𝑖1s^{i+1}italic_s start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT via DQN. Then (si,ai,ri,si+1)superscript𝑠𝑖superscript𝑎𝑖superscript𝑟𝑖superscript𝑠𝑖1(s^{i},a^{i},r^{i},s^{i+1})( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ) can be regarded as one training data, which is stored in the training dataset. If the data number of the training dataset has reached the upper limit, the oldest data will be replaced to improve the training performance. In addition, to better explore the possible keyframe extraction decisions, we also apply ϵitalic-ϵ\epsilonitalic_ϵ-greedy policy in the decision-making process. Within each step, a random number is generated as a determination. Specifically, aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT will be chosen randomly if the determination is smaller than ϵitalic-ϵ\epsilonitalic_ϵ. Otherwise, aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT will be generated using DQN as planned.

After the training dataset has got sufficient training data, the DQN will be trained every τ����\tauitalic_τ step, namely the training interval. To train the DQN during each training phase, we randomly select one batch of data from the dataset. Since the main goal of DQL is to maximize the reward of the final decision rather than focusing on the reward for each step, two DNNs with the same structure are used in one DQL model. One is the main DNN used to output the Q value of the current state, while another is the target DNN utilized to evaluate the next state decided by the main DNN. The Q-value produced by the main DNN is called Qmainsuperscript𝑄𝑚𝑎𝑖𝑛Q^{main}italic_Q start_POSTSUPERSCRIPT italic_m italic_a italic_i italic_n end_POSTSUPERSCRIPT, whereas the Q-value of the target DNN is denoted as Qtargsuperscript𝑄𝑡𝑎𝑟𝑔Q^{targ}italic_Q start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT. For training data (si,ai,ri,si+1)superscript𝑠𝑖superscript𝑎𝑖superscript𝑟𝑖superscript𝑠𝑖1(s^{i},a^{i},r^{i},s^{i+1})( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ), the expected output of Qmain(si,ai)superscript𝑄𝑚𝑎𝑖𝑛superscript𝑠𝑖superscript𝑎𝑖Q^{main}(s^{i},a^{i})italic_Q start_POSTSUPERSCRIPT italic_m italic_a italic_i italic_n end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) can be given as:

(23) ri+γmaxa(Qtarg(si+1,a))superscript𝑟𝑖𝛾subscript𝑚𝑎𝑥superscript𝑎superscript𝑄𝑡𝑎𝑟𝑔superscript𝑠𝑖1𝑎r^{i}+\gamma\mathop{max}\limits_{a^{\prime}}(Q^{targ}(s^{i+1},a’))italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_γ start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT , italic_a ’ ) )

where γ𝛾\gammaitalic_γ is the discount factor that balances the short view and the future reward. For all the sequences in the batch, we apply the Huber function as the loss function to calculate the loss between the output Q-value and the expected Q-value, which is then used to update the parameter of the main network.

After the model is trained, the performance of Qmainsuperscript𝑄𝑚𝑎𝑖𝑛Q^{main}italic_Q start_POSTSUPERSCRIPT italic_m italic_a italic_i italic_n end_POSTSUPERSCRIPT becomes better for extracting a new keyframe. The ability of Qtargsuperscript𝑄𝑡𝑎𝑟𝑔Q^{targ}italic_Q start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT then limits the training efficiency and accuracy of the model. Consequently, after π𝜋\piitalic_π epochs of training, we renew the network parameter of Qtargsuperscript𝑄𝑡𝑎𝑟𝑔Q^{targ}italic_Q start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT using the trained parameter of Qmainsuperscript𝑄𝑚𝑎𝑖𝑛Q^{main}italic_Q start_POSTSUPERSCRIPT italic_m italic_a italic_i italic_n end_POSTSUPERSCRIPT. The whole model can be trained with stability and efficiency as the iteration continues. The process is repeated until the model converges. The pseudo-code of the proposed SIDQL algorithm is provided in Algorithm 1.

Algorithm 1 SIDQL Algorithm
1:Training dataset 𝒮𝒮\mathcal{S}caligraphic_S, sequence Ssuperscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and keyframe number W
2:Keyframe extraction decision K
3:Initialization: Initialize the main network Qmainsuperscript𝑄𝑚𝑎𝑖𝑛Q^{main}italic_Q start_POSTSUPERSCRIPT italic_m italic_a italic_i italic_n end_POSTSUPERSCRIPT, the target network Qtargsuperscript𝑄𝑡𝑎𝑟𝑔Q^{targ}italic_Q start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT and the replay memory
4:Training steps j = 0
5:for  epochs i \leqdo
6:     Choose a sequence Sisuperscript𝑆𝑖S^{i}italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from 𝒮𝒮\mathcal{S}caligraphic_S
7:     Initialize the keyframe extraction decision K0subscript𝐾0K_{0}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
8:     for w \leqdo
9:         Regard Sisuperscript𝑆𝑖S^{i}italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and current decision Kwsubscript𝐾𝑤K_{w}italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT as state sisuperscript𝑠𝑖s^{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
10:         j = j + 1
11:         Generate a random number p within [0,1]
12:         if p \leq ϵitalic-ϵ\epsilonitalic_ϵ  then
13:              Randomly choose action aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
14:         else
15:              Set aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT=argmaxaisubscript𝑚𝑎𝑥superscript𝑎𝑖\mathop{max}\limits_{a^{i}}start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT Qmain(si,ai)superscript𝑄𝑚𝑎𝑖𝑛superscript𝑠𝑖superscript𝑎𝑖Q^{main}(s^{i},a^{i})italic_Q start_POSTSUPERSCRIPT italic_m italic_a italic_i italic_n end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
16:         end if
17:         Calculate the reward risuperscript𝑟𝑖r^{i}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT using (22)
18:         Renew decision Kwsubscript𝐾𝑤K_{w}italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT according to aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
19:         if database is not full then
20:              Store (si,ai,ri,si+1)superscript𝑠𝑖superscript𝑎𝑖superscript𝑟𝑖superscript𝑠𝑖1(s^{i},a^{i},r^{i},s^{i+1})( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ) in the replay memory
21:         else
22:              Replace the oldest data with (si,ai,ri,si+1)superscript���𝑖superscript𝑎𝑖superscript𝑟𝑖superscript𝑠𝑖1(s^{i},a^{i},r^{i},s^{i+1})( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT )
23:         end if
24:         if  mod(j,τ𝜏\tauitalic_τ)==0 then
25:              Randomly choose one batch of training data
26:              Train Qmainsuperscript𝑄𝑚𝑎𝑖𝑛Q^{main}italic_Q start_POSTSUPERSCRIPT italic_m italic_a italic_i italic_n end_POSTSUPERSCRIPT using the Adam optimizer
27:         end if
28:         if  mod(j, τπ𝜏𝜋\tau\piitalic_τ italic_π)==0 then
29:              Renew the parameter of Qtargsuperscript𝑄𝑡𝑎𝑟𝑔Q^{targ}italic_Q start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT
30:         end if
31:     end for
32:end for
33:Regard Ssuperscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and decision K0subscript𝐾0K_{0}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as state s0superscript𝑠0s^{0}italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
34:for w \leqdo
35:     Set aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT=argmaxaisubscript𝑚𝑎𝑥superscript𝑎𝑖\mathop{max}\limits_{a^{i}}start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT Qmain(si,ai)superscript𝑄𝑚𝑎𝑖𝑛superscript𝑠𝑖superscript𝑎𝑖Q^{main}(s^{i},a^{i})italic_Q start_POSTSUPERSCRIPT italic_m italic_a italic_i italic_n end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
36:     Renew decision Kwsubscript𝐾𝑤K_{w}italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT according to aisuperscript𝑎𝑖a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
37:end for
38:return Keyframe decisions K of Ssuperscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

4.3. Test

To verify the performance of the proposed algorithm, we first need to ensure that the model has converged after iterations. The loss values during each training epoch can be utilized to test the convergence performance. More importantly, we also need to evaluate the ability to solve the problem after the DQN model is converged.

The error between the generated and optimal decisions is hard to measure because it is hard to obtain the optimal solutions for the proposed keyframe extraction problem (𝒫𝒫\mathcal{P}caligraphic_P). So, we choose several algorithms as baselines and compare analyses on the same test dataset. We assume that T𝑇Titalic_T different sequences are contained in the test dataset 𝒯𝒯\mathcal{T}caligraphic_T, which can be denoted as 𝒯={St|1tT}𝒯conditional-setsubscriptsuperscript𝑆𝑡1𝑡𝑇\mathcal{T}=\{S^{*}_{t}\big{|}1\leq t\leq T\}caligraphic_T = { italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | 1 ≤ italic_t ≤ italic_T }. Then we define the mean angle error \mathcal{R}caligraphic_R as the evaluation criterion, given as:

(24) =1Tt=1TQ(St,Kt)1𝑇subscriptsuperscript𝑇𝑡1𝑄subscriptsuperscript𝑆𝑡subscriptsuperscript𝐾𝑡\mathcal{R}=\frac{1}{T}\sum^{T}_{t=1}Q(S^{*}_{t},K^{*}_{t})caligraphic_R = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_Q ( italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

where Ktsubscriptsuperscript𝐾𝑡K^{*}_{t}italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the keyframe extraction decision of sequence Stsubscriptsuperscript𝑆𝑡S^{*}_{t}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT generated by different approaches.

With the mean angle error function (24), we can first evaluate the influence of different hyperparameter settings. The best settings can be obtained by choosing the network with the lowest error. Also, the performance under different application scenarios should be considered. The mean angle error of the proposed approach and the baselines when extracting different numbers of keyframes will be further calculated to give us more insight into setting the keyframe numbers accordingly.

5. EXPERIMENTS

In this section, we set up simulations to illustrate that our scheme can effectively decrease the amount of data and transmission latency, while still maintaining a high level of accuracy in reconstruction. Experiments are designed by using the motion capture data in mixed categories.

5.1. Dataset

We conducted several experiments on the CMU Graphics Lab Motion Capture Database 111http://mocap.cs.cmu.edu, which is widely used in the related work of motion capture. It encompasses a diverse range of human motion data, amounting to approximately 10 hours of motion in total. In detail, the database contains 2605 sequences of motions in 6 categories and 23 subcategories by 111 human subjects, encompassing various activities such as locomotion, physical activities and interaction. By utilizing the database, we can access a large and diverse set of motions to validate the performance of the SIDQL algorithm.

5.2. Experiment Settings

In our experiments, we focus on utilizing 23 major joints of the human skeleton by excluding the unused joints from fingers, thumbs and toes to extract the keyframes. For each motion sequence, the frame rate reduction is applied from the initial 120 frames per second (FPS) to 30 FPS, so the time interval between two adjacent frames is 0.33s. Then, we crop the motion sequences into fixed-length motion subsequences of 60 frames each, which served as input for training the network. Subsequently, the database is divided into two subsets, where 80%percent\%% of the sequences is the training set, and the rest 20%percent\%% is the test set.

During the training of DQN, the agent interacts with the environment by selecting actions based on the current Q-values and receives a reward for each action taken. The parameters in the main network are updated every 200 steps, while the weights of the Q-network are updated every 100 epochs. In addition, the discount factor of the reward function is set as γ𝛾\gammaitalic_γ = 0.5.

TABLE 2. Evaluation Parameter
Parameters Values
The number of frames in a sequence N = 60.
The number of points in a frame M = 23.
The frames per second of the sequence FPS = 30.
The time interval between two adjacent frames ΔΔ\Deltaroman_Δ = 0.33s.
The training interval of DQN τ𝜏\tauitalic_τ = 200 steps.
The renew interval of Qtargsuperscript𝑄𝑡𝑎𝑟𝑔Q^{targ}italic_Q start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g end_POSTSUPERSCRIPT π𝜋\piitalic_π = 100 epochs.
The discount factor of the reward function γ𝛾\gammaitalic_γ = 0.5.

For the DQN in our algorithm, we consider the main network and Q-network with two hidden layers as the decision-making module and a fully connected network to extract the features of the keyframes. We implement our algorithm in Python 3.9.12 with TensorFlow 3.6.13. All the simulations are performed based on an Intel Core i9-12900H CPU and 32.0 GB memory.

5.3. Training Evaluation

In this part, we evaluate the performance of the SIDQL algorithm by measuring the mean error and loss value under different learning rates, training intervals, memory sizes, and batch sizes.

5.3.1. Impact of Learning Rate


Learning rate plays a vital role as a hyperparameter that significantly impacts the performance of our approach. In DQN-based models, the learning rate has a crucial influence. When the learning rate is too low, the model converges slowly, requiring more epochs for training or getting stuck in suboptimal solutions. Conversely, if the learning rate is too large, the model converges quickly but may fail to achieve high accuracy due to overshooting or instability.

Refer to caption

Figure 3. Simulation result of learning rate. Left: Loss performance of learning rate. Right: Mean angle error of learning rate.

During our experiments, the learning rate is adjusted from 0.1 to 0.0001, and the loss performance can be shown in the left image of Fig. 3. Since our algorithm is initialized with random weights, the predictions may be far from the target values at the beginning of training, resulting in a high loss. So, when plotting the figure, the logarithm base 10 is applied to the loss values to better visualize and interpret the data. It is observed that the loss generally converges to a stable value regardless of the learning rate used. The changing speed of the loss values becomes faster at the beginning and then becomes stable when the algorithm starts to learn and adjust its parameters. When our algorithm updates the weights of the network, the loss decreases significantly. Moreover, a learning rate of 0.01 can lead to a stable and effective training process with minimized loss. An overly high learning rate may result in fluctuating loss values, indicating an unstable training process. Conversely, a learning rate that is too low may lead to slow convergence or getting stuck in suboptimal solutions, which hinders the model’s capacity to extract accurate and representative keyframes.

Indeed, the loss values can affect the quality of the reconstructed motion. Fig. 3 (right-hand side) shows the mean angle error under different learning rates. When the learning rate is set to 0.01, it yields the minimum loss value and the lowest error, suggesting that the algorithm achieves the best reconstruction performance in minimizing the discrepancy between the predicted motions and the ground truth.

5.3.2. Impact of Training Interval


Training interval in our algorithm refers to the frequency at which the main network updates its parameters through training. It determines how often the model learns from its experiences and adjusts its action-selection strategy. By setting an appropriate training interval, we can balance efficiency and stability in our algorithm.

Refer to caption

Figure 4. Mean angle error of training interval.

Training interval determines how often the model parameters are updated. Still, it does not directly influence the calculation of the loss value, so we do not analyze the loss performance under different training intervals. Fig. 4 shows the mean angle error of different training intervals in motion reconstruction. It is important to note that training intervals cannot be too large or too small. On the one hand, if the training interval is too small, meaning that the model is updated frequently, it can lead to overfitting. On the other hand, if the training interval is too large, the model updates infrequently, which can slow down the learning process, resulting in a large computational cost in the keyframe extraction process. Consequently, we set the training interval to 100 to avoid overfitting and more training time to extract the keyframes.

5.3.3. Impact of Memory Size


Memory size, often referred to as the replay buffer size, is an important factor in DQN-based models due to its impact on the learning process and the model’s ability to effectively utilize past experiences. For traditional DQN-based models, if the memory size is too large, the model may spend more time replaying redundant or less informative experiences, leading to slower convergence and inefficient learning. On the contrary, the model is more likely to sample experiences that are temporally correlated or biased towards recent data, compromising the model’s performance.

Refer to caption

Figure 5. Simulation result of memory size. Left: Loss performance of memory size. Right: Mean angle error of memory size.

During our experiments, memory size is adjusted from 1000 to 50000, and each model is trained with 5000 training steps, as shown in the left image in Fig. 5. Obviously, irrespective of the memory size used, the loss typically converges to a stable value during the training process. When memory size is relatively small, the loss exhibits significant fluctuations, potentially affecting the accuracy and reliability of the extracted keyframes. On the contrary, with a large memory size, the model introduces computational inefficiencies and redundant data. So, striking the right balance in memory size is crucial for achieving accurate and efficient keyframe extraction in motion capture. When the memory size is set to 10,000, the loss reaches its minimum and remains stable. Additionally, we can notice that the loss tends to stabilize after around 1600 training iterations. This stabilization indicates that the model has reached a relatively stable state in terms of its learning progress, which can positively impact the accuracy and reliability of keyframe extraction.

The right image of Fig. 5 presents the mean angle error across various memory sizes, providing insights into the impact of memory size on the accuracy of the motion reconstruction process. It is evident that the mean angle error is minimized when the memory size is set to 10,000, indicating that the keyframe extraction decisions are optimal at this memory size. Therefore, selecting an appropriate memory size is crucial for achieving superior results in motion capture applications.

5.3.4. Impact of Training Batch Size


Batch size is another hyperparameter that can be tuned to improve the performance of DQN-based models, which is the number of experiences (i.e., state, action, reward, next state) that are fed into the network at once during training. If the batch size is too small, the gradient estimates used to update the weights of the network will be noisy and may not accurately reflect the true gradient of the loss function, resulting in slow convergence, poor performance and instability in the learning process. On the other hand, if the batch size is too large, the network may become overfitted. It will influence the generalization ability of the model.

Refer to caption

Figure 6. Simulation result of batch size. Left: Loss performance of batch size. Right: Mean angle error of batch size.

The left image of Fig. 6 illustrates the loss performance obtained during our experiments, where we vary the batch size from 64 to 1024. Choosing an appropriate batch size that balances the trade-off between stability and efficiency in our algorithm is important. The minimum loss is achieved and maintained by setting the batch size to 256. It is worth noticing that when the batch size is small, the loss shows significant fluctuations, and the fluctuation is also significantly larger than the big batch size after the loss convergence. That is because when the batch size is too small, our algorithm may not be able to effectively learn the features of the input motion data, so the extracted keyframes cannot accurately represent the sequence. But when the batch size is too large, it can lead to overfitting, also resulting in poor performance in keyframe extraction. In addition, no matter what batch size is used, the loss usually converges to a stable value during the training process.

The mean angle error across various batch sizes is shown in the right image in Fig. 6. Obviously, the mean angle error is minimized when the batch size is set to 256, showing that our algorithm can successfully learn the input motion data’s features at this batch size and prevent overfitting. Additionally, the mean angle error is nearly identical when the batch size is 512 and 1024. This can be attributed to the fact that a larger batch size can cause slower convergence or overfitting, which can adversely affect the performance of motion reconstruction after selecting the keyframes.

5.4. Performance Comparison

By evaluating the mean error and loss performance under different hyperparameters, we have proven that our SIDQL algorithm can be trained without labeled data and effectively extract the keyframes. To better understand our algorithm’s performance, we choose several traditional algorithms as baselines. We then changed the number of keyframes from 5 to 15 and tested the mean error of the methods. The following methods are tested for comparison analysis:

  • Random Choice (RC): In this method, we first select the first and last frames of the motion sequence as keyframes, then randomly select a certain number of keyframes from the other frames. Each frame has an equal probability of being selected as a keyframe.

  • Uniform Choice (UC): In this method, the first and last frames of the motion sequence are selected as keyframes. We then uniformly select frames from the sequence as keyframes.

  • Greedy: This method involves setting an original keyframe set that includes the first and last frames of a given motion sequence. For each frame in a motion sequence, we set the current frame and other keyframes in the original keyframe set as the new keyframe set, then apply the reconstruction algorithm and calculate the mean error of the new keyframe set. Subsequently, we select the frame with the lowest mean error as the next keyframe, and add it to the original keyframe set. The iteration is repeated the desired number of keyframes is selected. Since the Greedy algorithm explores multiple possibilities for extracting keyframes, it achieves high accuracy. Hence, it is defined as the skyline.

  • Our algorithm: In this method, we applied the proposed SIDQL algorithm to extract the keyframes.

TABLE 3. Mean Error under different number of keyframes
Method 5 keyframes 10 keyframes 15 keyframes
RC 0.1437 0.0833 0.0549
UC 0.0945 0.0490 0.0328
Greedy 0.0536 0.0311 0.0200
Ours 0.0844 0.0443 0.0297

Refer to caption

Figure 7. The reconstructed motion by different methods when extracted 5 keyframes. Left: Jump. Middle: Run. Right: Pick up something.

The comparison results of different baselines are depicted in Table 3. In comparison between the RC and UC methods, our method extracts better keyframes with less reconstruction error in different numbers of keyframes. In the RC method, randomly selecting keyframes may lead to the omission of crucial keyframes, which can negatively impact the accuracy of motion reconstruction. For the UC method, when the motion changes rapidly, the reconstructed motion sequence is not coherent, which affects the user experience. Our method selects keyframes related to the motion content, improving the accuracy of keyframe extraction and reducing errors. However, our results are slightly inferior to the Greedy method. It can be attributed to the fact that the Greedy algorithm explores multiple possibilities for extracting keyframes, while our method does not consider all possible combinations. Although our method’s accuracy is not as good as the Greedy method, when extracting 5 keyframes for a motion sequence, the Greedy method takes about 0.9s to generate keyframe decisions, while we can generate decisions within 0.004s. Therefore, compared to the Greedy method, our method generates decisions in a very short time, which can meet the low-latency requirements for avatar motion synchronization in the Metaverse.

We visualize the reconstruction results obtained by extracting 5 keyframes. From these animated characters, it is obvious that the reconstructed frames by our SIDQL algorithm and Greedy method are very similar to the ground truth frames, while the reconstructed frames by the RC and the UC are more or less unnatural and incoherent. In the left image of Fig. 7, the motion sequence about jumping is from subject 01_02. Obviously, some key joints, such as the knee, leg and elbow, move widely, and the reconstructed motion by the RC method has a significant deviation, compared to the original motion. The motion sequences reconstructed by UC, Greedy, and our method are almost indistinguishable from the ground truth sequence. When the motion is fast and wide, the differences between the motion sequences reconstructed by different methods become more apparent. For instance, the middle image in Fig. 7 displays the running motion on subject 104_122, which is a relatively fierce motion. The motions reconstructed by the RC and the UC are not performed well. For the RC method, the motion of the arms and legs is highly uncoordinated, and there is a significant positional deviation compared to the original motion. Unlike the ground truth, the UC method’s body swings as it runs and has unnatural leg movements. In contrast, the motion sequences reconstructed by the Greedy method and our method have higher accuracy, and are similar to the original motion, which indicates that the keyframes selected by these two methods are representative of the motion sequence. However, due to the fast and large movements during running, there are brief moments of unnatural motion in the foot movement of our method and the arm movement of the Greedy method. In addition, when the motion speed is slow, the differences between the motion sequences reconstructed by different methods are small, and the motion is natural and coherent, as shown in the right image of Fig. 7.

6. CONCLUSION AND FUTURE WORK

In this study, we initially developed a novel motion reconstruction algorithm predicated on spherical interpolation techniques. This algorithm transforms the positional and velocity data of each point into spherical coordinates, ensuring the constancy of bone length. Building upon this innovative reconstruction method, we formulated the keyframe extraction problem as an optimization task aimed at minimizing the mean reconstruction error. Subsequently, we introduced the SIDQL framework, which incorporates a specialized reward function derived from the mean error metric. This framework is adept at learning from mixed-category motion sequences without the necessity for labeled keyframes. We then assessed the reconstruction efficacy of our proposed algorithm by comparing it with several baseline methods across varying number of keyframes. Experimental outcomes attest to our algorithm’s capability to considerably curtail data volume while preserving high reconstruction fidelity, fulfilling the stringent low-latency demands of the Metaverse.

The principal contribution of this work is the proposition and validation of the SIDQL algorithm for keyframe extraction and motion reconstruction within the realm of motion capture. The algorithm we propose can be seamlessly integrated with edge computing and federated learning paradigms to further diminish transmission latency and enrich the user experience. To more efficaciously tackle real-world applications, it is imperative to conduct a comparative performance evaluation of deep learning and reinforcement learning methodologies within the same framework to identify the most efficacious strategy. Additionally, the application of meta-learning strategies can be explored to enhance the model’s adaptability across diverse scenarios. Looking ahead, we aim to deploy the algorithm in tangible real-world settings.

References

  • (1)
  • Bulut and Capin (2007) Eyuphan Bulut and Tolga Capin. 2007. Key frame extraction from motion capture data by curve saliency. In Computer animation and social agents, Vol. 20.
  • Chang et al. (2016) Xiaojing Chang, Pengfei Yi, and Qiang Zhang. 2016. Key frames extraction from human motion capture data based on hybrid particle swarm optimization algorithm. Recent Developments in Intelligent Information and Database Systems (2016), 335–342.
  • Chen et al. (2023) Xiaowei Chen, Xiao Jiang, Lishuang Zhan, Shihui Guo, Qunsheng Ruan, Guoliang Luo, Minghong Liao, and Yipeng Qin. 2023. Full-body human motion reconstruction with sparse joint tracking using flexible sensors. ACM Transactions on Multimedia Computing, Communications and Applications 20, 2, 1–19.
  • Després and Jia (2017) Philippe Després and Xun Jia. 2017. A review of GPU-based medical image reconstruction. Physica Medica 42 (2017), 76–92.
  • Elhamifar et al. (2012) Ehsan Elhamifar, Guillermo Sapiro, and Rene Vidal. 2012. See all by looking at a few: Sparse modeling for finding representative objects. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 1600–1607.
  • Guru et al. (2018) DS Guru, VK Jyothi, and YH Sharath Kumar. 2018. Cluster based approaches for keyframe selection in natural flower videos. In Intelligent Systems Design and Applications: 17th International Conference on Intelligent Systems Design and Applications (ISDA 2017) held in Delhi, India, December 14-16, 2017. Springer, 474–484.
  • Halici and Bostanci (2019) Egemen Halici and Erkan Bostanci. 2019. Evaluating the Use of Interpolation Methods for Human Body Motion Modelling. International Journal of Computer Theory and Engineering 10, 1 (2019), 25–29.
  • Halit and Capin (2011) Cihan Halit and Tolga Capin. 2011. Multiscale motion saliency for keyframe extraction from motion capture sequences. Computer Animation and Virtual Worlds 22, 1 (2011), 3–14.
  • Howarth and Callaghan (2010) Samuel J Howarth and Jack P Callaghan. 2010. Quantitative assessment of the accuracy for three interpolation techniques in kinematic analysis of human movement. Computer methods in biomechanics and biomedical engineering 13, 6 (2010), 847–855.
  • Huang et al. (2005) Ke-Sen Huang, Chun-Fa Chang, Yu-Yao Hsu, and Shi-Nine Yang. 2005. Key probe: a technique for animation keyframe extraction. The Visual Computer 21 (2005), 532–541.
  • Huang et al. (2022) Ruobing Huang, Qilong Ying, Zehui Lin, Zijie Zheng, Long Tan, Guoxue Tang, Qi Zhang, Man Luo, Xiuwen Yi, Pan Liu, et al. 2022. Extracting keyframes of breast ultrasound video using deep reinforcement learning. Medical Image Analysis 80 (2022), 102490.
  • Jadon and Jasim (2020) Shruti Jadon and Mahmood Jasim. 2020. Unsupervised video summarization framework using keyframe extraction and video skimming. In 2020 IEEE 5th International Conference on computing communication and automation (ICCCA). IEEE, 140–145.
  • Kim et al. (2021) Seong Uk Kim, Hanyoung Jang, Hyeonseung Im, and Jongmin Kim. 2021. Human motion reconstruction using deep transformer networks. Pattern Recognition Letters 150 (2021), 162–169.
  • Lam et al. (2022) Kit Yung Lam, Liang Yang, Ahmad Alhilal, Lik-Hang Lee, Gareth Tyson, and Pan Hui. 2022. Human-Avatar Interaction in Metaverse: Framework for Full-Body Interaction. In Proceedings of the 4th ACM International Conference on Multimedia in Asia. 1–7.
  • Lee et al. (2021) Lik-Hang Lee, Tristan Braud, Pengyuan Zhou, Lin Wang, Dianlei Xu, Zijun Lin, Abhishek Kumar, Carlos Bermejo, and Pan Hui. 2021. All one needs to know about metaverse: A complete survey on technological singularity, virtual ecosystem, and research agenda. arXiv preprint arXiv:2110.05352 (2021).
  • Li et al. (2020) Miaopeng Li, Zimeng Zhou, and Xinguo Liu. 2020. Cross refinement techniques for markerless human¡? brk?¿ motion capture. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 1, 1–18.
  • Lin et al. (2018) Ting-Lan Lin, Hua-Wei Tseng, Yangming Wen, Fu-Wei Lai, Ching-Hsuan Lin, and Chuan-Jia Wang. 2018. Reconstruction algorithm for lost frame of multiview videos in wireless multimedia sensor network based on deep learning multilayer perceptron regression. IEEE Sensors Journal 18, 23 (2018), 9792–9801.
  • Liu et al. (2013) Xian-mei Liu, Ai-min Hao, and Dan Zhao. 2013. Optimization-based key frame extraction for motion capture animation. The visual computer 29 (2013), 85–95.
  • Ly et al. (2018) Son Thai Ly, Guee-Sang Lee, Soo-Hyung Kim, and Hyung-Jeong Yang. 2018. Emotion recognition via body gesture: Deep learning model coupled with keyframe selection. In Proceedings of the 2018 international conference on machine learning and machine intelligence. 27–31.
  • Mo et al. (2021) Clinton Mo, Kun Hu, Shaohui Mei, Zebin Chen, and Zhiyong Wang. 2021. Keyframe extraction from motion capture sequences with graph based deep reinforcement learning. In Proceedings of the 29th ACM International Conference on Multimedia. 5194–5202.
  • Pollefeys et al. (2008) Marc Pollefeys, David Nistér, J-M Frahm, Amir Akbarzadeh, Philippos Mordohai, Brian Clipp, Chris Engels, David Gallup, S-J Kim, Paul Merrell, et al. 2008. Detailed real-time urban 3d reconstruction from video. International Journal of Computer Vision 78 (2008), 143–167.
  • Prabakaran et al. (2022) T Prabakaran, Loveleen Kumar, S Ashabharathi, S Prabhavathi, Maneesh Vilas Deshpande, and Mochammad Fahlevi. 2022. Key Frame Extraction Analysis Based on Optimized Convolution Neural Network (OCNN) using Intensity Feature Selection (IFS). In 2022 2nd International Conference on Technological Advancements in Computational Sciences (ICTACS). IEEE, 842–847.
  • Pu et al. (2021) Bin Pu, Kenli Li, Shengli Li, and Ningbo Zhu. 2021. Automatic fetal ultrasound standard plane recognition based on deep learning and IIoT. IEEE Transactions on Industrial Informatics 17, 11 (2021), 7771–7780.
  • Rebecq et al. (2019a) Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. 2019a. Events-to-video: Bringing modern computer vision to event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3857–3866.
  • Rebecq et al. (2019b) Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. 2019b. High speed and high dynamic range video with an event camera. IEEE transactions on pattern analysis and machine intelligence 43, 6 (2019), 1964–1980.
  • Savran Kızıltepe et al. (2021) Rukiye Savran Kızıltepe, John Q Gan, and Juan José Escobar. 2021. A novel keyframe extraction method for video classification using deep neural networks. Neural Computing and Applications (2021), 1–12.
  • Sujatha and Mudenagudi (2011) C Sujatha and Uma Mudenagudi. 2011. A study on keyframe extraction methods for video summary. In 2011 International Conference on Computational Intelligence and Communication Networks. IEEE, 73–77.
  • Sun et al. (2018) Bin Sun, Dehui Kong, Shaofan Wang, and Jinghua Li. 2018. Keyframe extraction for human motion capture data based on affinity propagation. In 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON). IEEE, 107–112.
  • Vlahovic et al. ([n. d.]) Sara Vlahovic, Ivan Slivar, Matko Silic, Lea Skorin-Kapov, and MirkoSuznjevic. [n. d.]. Exploring the Facets of the Multiplayer VR Gaming Experience. ACM Transactions on Multimedia Computing, Communications and Applications.
  • Wang et al. (2019) Lin Wang, Yo-Sung Ho, Kuk-Jin Yoon, et al. 2019. Event-based high dynamic range image and very high frame rate video generation using conditional generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10081–10090.
  • Wang et al. (2011) Zheshen Wang, Mrityunjay Kumar, Jiebo Luo, and Baoxin Li. 2011. Extracting key frames from consumer videos using bi-layer group sparsity. In Proceedings of the 19th ACM international conference on Multimedia. 1505–1508.
  • Weng et al. (2021) Wenming Weng, Yueyi Zhang, and Zhiwei Xiong. 2021. Event-based video reconstruction using transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2563–2572.
  • Xia et al. (2019) Guiyu Xia, Huaijiang Sun, Qingshan Liu, and Renlong Hang. 2019. Learning-based sphere nonlinear interpolation for motion synthesis. IEEE Transactions on Industrial Informatics 15, 5 (2019), 2927–2937.
  • Xia et al. (2016) Guiyu Xia, Huaijiang Sun, Xiaoqing Niu, Guoqing Zhang, and Lei Feng. 2016. Keyframe extraction for human motion capture data based on joint kernel sparse representation. IEEE Transactions on Industrial Electronics 64, 2 (2016), 1589–1599.
  • Xia et al. (2022) Guiyu Xia, Peng Xue, Huaijiang Sun, Yubao Sun, Du Zhang, and Qingshan Liu. 2022. Local Self-Expression Subspace Learning Network for Motion Capture Data. IEEE Transactions on Image Processing 31 (2022), 4869–4883.
  • Yan and Woźniak (2022) Gong Yan and Marcin Woźniak. 2022. Accurate key frame extraction algorithm of video action for aerobics online teaching. Mobile networks and applications 27, 3 (2022), 1252–1261.
  • Yu et al. (2020) Lei Yu, Wen Yang, et al. 2020. Event-based high frame-rate video reconstruction with a novel cycle-event network. In 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 86–90.
  • Zhang et al. (2014a) Qiang Zhang, Xiang Xue, Dongsheng Zhou, and Xiaopeng Wei. 2014a. Motion Key-frames extraction based on amplitude of distance characteristic curve. International journal of computational intelligence systems 7, 3 (2014), 506–514.
  • Zhang et al. (2013) Qiang Zhang, Shao-Pei Yu, Dong-Sheng Zhou, and Xiao-Peng Wei. 2013. An efficient method of key-frame extraction based on a cluster algorithm. Journal of human kinetics 39 (2013), 5.
  • Zhang et al. (2014b) Qiang Zhang, Shulu Zhang, and Dongsheng Zhou. 2014b. Keyframe extraction from human motion capture data based on a multiple population genetic algorithm. Symmetry 6, 4 (2014), 926–937.
  • Zhao et al. (2022) Yuheng Zhao, Jinjing Jiang, Yi Chen, Richen Liu, Yalong Yang, Xiangyang Xue, and Siming Chen. 2022. Metaverse: Perspectives from graphics, interactions and visualization. Visual Informatics (2022).