neuROSym: Deployment and Evaluation of a ROS-based Neuro-Symbolic Model for Human Motion Prediction

Sariah Mghames1, Luca Castri1, Marc Hanheide1, Nicola Bellotto1,2 1School of Computer Science, University of Lincoln, UK.
2Dept. of Information Engineering, University of Padua, Italy.
This project has received funding from the EU’s Horizon 2020 Research and Innovation programme under grant agreement No 101017274
Abstract

Autonomous mobile robots can rely on several human motion detection and prediction systems for safe and efficient navigation in human environments, but the underline model architectures can have different impacts on the trustworthiness of the robot in the real world. Among existing solutions for context-aware human motion prediction, some approaches have shown the benefit of integrating symbolic knowledge with state-of-the-art neural networks. In particular, a recent neuro-symbolic architecture (NeuroSyM) has successfully embedded context with a Qualitative Trajectory Calculus (QTC) for spatial interactions representation. This work achieved better performance than neural-only baseline architectures on offline datasets. In this paper, we extend the original architecture to provide neuROSym, a ROS package for robot deployment in real-world scenarios, which can run, visualise, and evaluate previous neural-only and neuro-symbolic models for motion prediction online. We evaluated these models, NeuroSyM and a baseline SGAN, on a TIAGo robot in two scenarios with different human motion patterns. We assessed accuracy and runtime performance of the prediction models, showing a general improvement in case our neuro-symbolic architecture is used. We make the neuROSym package111https://github.com/sariahmghames/neuROSym publicly available to the robotics community.

I Introduction

The integration of autonomous mobile robots in logistics, transportation, and healthcare, is rapidly increasing, as is user trust in the technologies used to build them. One key requirement for mobile robots’ trustworthiness is the ability to navigate safely among humans, in addition to other capabilities such as intelligent interaction and successful task completion. Safe navigation usually requires the detection and prediction of human motion. This is necessary not only for autonomous navigation, but also for action and intent recognition, anomaly detection, and other tasks. Existing methods for human motion detection and prediction can be divided into context-agnostic and (static or dynamic) context-aware. Taking context into account (e.g. knowing whether the robot is in a warehouse or a supermarket) can have a significant impact on the accuracy of the motion prediction system.

Refer to caption
Figure 1: Deployment of neuROSym for online and context-aware human motion prediction, with real-time visualisation. The three blocks are ROS nodes, while the filled arrows and the dashed ones represent the online and offline inference, respectively. Each arrows label indicates the type of messages published and/or subscribed to by each node.

Context reasoning can include human-human and/or human-objects spatial interactions. The latter can be described by quantitative or qualitative representations. In the quantitative approach, interactions are typically embeddings of absolute or relative agents pose in a neural network model. The authors in [1] have jointly modeled human-robot and human-human interactions in a deep reinforcement learning framework for mobile robot navigation. In [2], instead, the authors learn an optimal local trajectory from a global plan by fusing human trajectories, LiDAR features, global path, and odometry features in particular attention layer. Context-awareness methods have also been proposed to deal with the challenges faced by the long-term prediction of single human motion [3, 4, 5, 6, 7, 8].

Qualitative representations of spatial interactions, though intuitive and computationally efficient, are less explored for context-aware human motion prediction. Recently, a new approach has been proposed that exploits a qualitative representation of spatial interactions to improve human motion prediction [9]. This consists in a neuro-symbolic model that has been proved to be effective in predicting human motion trajectories. The symbolic part is indeed a qualitative representation of spatial interactions between pairs of agents using the so-called Qualitative Trajectory Calculus (QTC) [10, 11]. QTC-based models of moving agent pairs can be described by different combinations of QTC symbols, which depends on properties of the interaction such as relative distance changes (i.e moving towards/away), velocity (i.e moving faster/slower), and orientation (i.e. moving to the left/right side).

However, deploying and validating these methods for context-aware human motion prediction on real-world robot domains has been only partly explored [5, 9, 7, 12, 6]. While offline experimental evaluation is essential for model comparison and selection, the actual performance of any chosen architecture can only be validated during the deployment phase, since real-world problems such as domain-shift and inference time can significantly affect the human prediction system. Based on the above considerations, and extending our previous NeuroSyM model for context-aware human motion prediction [9], the main contribution of this paper is two-fold:

  • a new ROS package for online human motion prediction and visualisation, called neuROSym, which is publicly available and includes the three blocks (inference, post-processing, and visualisation) depicted in Fig. 1;

  • a performance evaluation of the package on a real-world robot scenario with two different experimental settings: (i) people moving in the same directions, and (ii) people crossing each other’s paths.

The remainder of the paper is as follows: Sec. II presents an overview of the related works; Sec. III explains the approach adopted for human-context reasoning and deployment; Sec. IV illustrates and discusses the experimental results; finally, Sec. V concludes by summarising the main outcomes and suggesting future research directions.

II Related Work

Context-aware human motion prediction. Among the existing works in the area of context-aware human motion prediction [13], some incorporate spatio-temporal dependencies between interactions [14, 15, 16], while others consider spatial relations only [3, 4, 5, 6, 7, 8, 12, 17]. These can be further grouped in solutions that take into account static context [8], dynamic context [3, 5, 6, 15, 16], or both [14, 4, 7, 12, 17].

Two of the most popular architectures for human motion prediction are Social-LSTM (S-LSTM) [3] and Social Generative Adversarial Network (SGAN) [5]. Both use a spatially-aware pooling mechanism for incorporating the hidden states of nearby agents as a way to overcome the problem of variable and (potentially) large number of people in the scene. SGAN, however, is generally better than S-LSTM in terms of accuracy and time complexity, since it avoids grid-based pooling. SGAN features also lower time complexity and number of parameters compared to the Spatial-Temporal Graph Attention (STGAT) network [16]. Recently, we propsed a new neuro-symbolic approach [9] for context-aware human motion prediction, called NeuroSyM, which showed higher prediction accuracy on public datasets compared to SGAN. Other promising approaches have also demonstrated to improve networks performance, like the endpoint conditioned trajectories prediction in [6], the combined future activities and location prediction in [7], and the dynamic and static context-aware motion predictor in [14].

In this paper, we focus on the dynamic aspect of context, since it is typically the most challenging part for a mobile robot. We study in particular the real-world performance in human motion prediction of NeuroSyM against an SGAN baseline, implemented in a common ROS-based software framework for robot deployment.

Human-human interaction modeling: The methods to represent the interactions of nearby agents can be divided into one-to-one modeling and crowd modeling [18, 19, 20]. One-to-one interactions can be described by quantitative or qualitative representations. While the former have recently been modeled by multi-layer perceptrons embedding relative positions or velocities of agent pairs [1, 5, 3], qualitative approaches use symbolic representations, for example QTC-based [11, 21, 22]. Similar models were used in [23] to implement human-aware robot navigation strategies, where the prediction of interactions was limited to a Bayesian temporal model of single human-robot pairs, without taking into account nearby static or dynamic objects.

In our study, we consider one-to-one (i.e. pairwise) interactions through our previous NeuroSyM prediction system [9], which exploits QTC relations to weight the quantitative embedding of spatial interactions. However, we consider all the pairwise interactions in the neighbourhood, not just a single human-robot one.

Human motion prediction deployment. Most of the research on human motion prediction is evaluated on public datasets. In [24], the authors compared four different types of online human motion prediction for safe and efficient human-robot collaboration. Their models use linear regression and neural networks, with or without parameters adaptation. The authors showed that adaptable prediction models parameterized by neural networks achieved the best performance. They did not consider context-aware and long-term predictions though.

In [25], human motion prediction with Social Force Models (SFMs) was exploited for real-world people tracking. The work in [26] proposes a GAN-based solution that teaches the robot to mimic and predict human motion, but with a focus on actions rather than walking trajectories, and without taking into account context. In [27] instead, the authors validated a probabilistic framework, based on SFM and intention estimation, for human motion prediction with moving obstacles. While these solutions worked effectively on real-world systems, they rely significantly on the optimisation of their model parameters for each separate pair of interactions, and on the clustering methods to represent the observed scene.

In [28], the authors proposed real-time human motion prediction for robotics applications using physics-informed neural networks that embed SFM dynamic equations. Their model was trained on synthetic data and validated offline on one person only, without any domain-shift tests. While this latter is perhaps the closest to our current work, we deploy and evaluate both neural-only and neuro-symbolic approaches for context-aware human motion prediction considering multiple moving agents and walking patterns.

III neuROSym Architecture

In our previous work [9], we proposed a neuro-symbolic architecture for context-aware human motion prediction and validated its accuracy performance against baseline models in an offline setting, where public datasets were locally stored and used to train and test offline the models under investigation. It is well known, however, that the prediction performance may degrade when pre-trained models are transferred to real-world settings, especially due to domain-shift changes. Therefore, in this section, we present the neuROSym package for deployment and validation on real robots, which extends our previous work and provides an online evaluation tool, in terms of accuracy and runtime performance, for neural-only and neuro-symbolic prediction models.

III-A Main neuROSym Components

The new ROS package neuROSym, illustrated in Fig. 1, consists of the following three nodes:

  • Inference model node: it implements two subscribers to the same observational data topic whose messages are generated by a human tracker library. In parallel, it implements two publishers for the data visualisation and analytics node. Each pair subscriber-publisher corresponds to either ground truth or predicted samples. The node implements also the inference model for the prediction method under investigation. In this paper, we benchmark two state-of-the-art models: SGAN [5] and NeuroSyM [9]. We re-trained both models on two public datasets: Zara01 from the UCY dataset [29] and, for the first time, on the THOR222http://thor.oru.se/ dataset [30], where human motion patterns differ from the previous one.

  • Data visualisation and analytics node: this node runs in parallel to the inference node in order to generate, online, plots of the ground truth and predicted trajectories. It also generates average performance metrics, simultaneously to the visual plots.

  • Data post-processing node: this node is required to perform corrections in case the human tracking system misses some detections. If that happens, some people would be assigned different IDs over time. This node uses an offline ROS-Rviz visualiser to help matching different IDs of the same person in the scene.

III-B Inference Model

The inference model node of neuROSym is based on our previous work [9]. Here the generator part of both the NeuroSyM and the SGAN architectures consists of an encoder-pooling-decoder set of layers. In the following, we present briefly how the pooling mechanism of NeurSyM-SGAN (Fig. 2) incorporates the symbolic QTC knowledge in one of its layers. For a detailed explanation of the NeuroSyM architecture, including its performance on desktop experiments, we remand to the original paper [9].

Refer to caption
Figure 2: The NeuroSyM pooling mechanism with prior QTC knowledge injected into the output of the relative pose embedding layer.

Qualitative formulation of spatial interactions. A spatial interaction is represented by a vector of m𝑚mitalic_m QTC relations [10], which consist of qualitative symbols qi{,0,+}subscript𝑞𝑖0{q_{i}\in\{-,0,+\}}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { - , 0 , + }, for i=1,,m𝑖1𝑚i=1,...,mitalic_i = 1 , … , italic_m. Among the different QTC versions, NeuroSyM adopts the double-cross QTCC1, since it better represents the dynamics of the agents in our application scenario. More specifically, NeuroSyM was tested in [9] with the four symbols {q1,q2,q3,q4}subscript𝑞1subscript𝑞2subscript𝑞3subscript𝑞4\{q_{1},\,q_{2},\,q_{3},\,q_{4}\}{ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }, where q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent the relative motion between a pair of agents (moving towards or away from), while q3subscript𝑞3q_{3}italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and q4subscript𝑞4q_{4}italic_q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT represent the side relation (moving to the left or to the right of). An example of QTCC1 relations is shown in Fig. 3.

Refer to caption
Figure 3: An example of QTCC1 representation of interactions between three body points Phsubscript𝑃P_{h}italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

Neuro-symbolic architecture. To label interactions, NeuroSyM exploits QTCC1 and the related concept of Conceptual Neighbourhood Diagram (CND) described in [10, 31]. The nodes of a CND are different QTC states, while edges represent the “closeness” of two QTC states at time t𝑡titalic_t and t+1𝑡1t+1italic_t + 1. In [9], we formulated the interaction label αCNDsubscript𝛼𝐶𝑁𝐷\alpha_{CND}italic_α start_POSTSUBSCRIPT italic_C italic_N italic_D end_POSTSUBSCRIPT for each QTC state as the likelihood of a transition in the CND as follows:

αCND=P(QTCt+1|QTCt)=1NTrsubscript𝛼𝐶𝑁𝐷𝑃conditionalsubscriptQTC𝑡1subscriptQTC𝑡1subscript𝑁𝑇𝑟\alpha_{CND}=P(\mathrm{QTC}_{t+1}|\mathrm{QTC}_{t})=\frac{1}{N_{Tr}}italic_α start_POSTSUBSCRIPT italic_C italic_N italic_D end_POSTSUBSCRIPT = italic_P ( roman_QTC start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | roman_QTC start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT end_ARG (1)

where NTrsubscript𝑁𝑇𝑟N_{Tr}italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT is the number of possible transitions from the current state. In practice, αCNDsubscript𝛼𝐶𝑁𝐷\alpha_{CND}italic_α start_POSTSUBSCRIPT italic_C italic_N italic_D end_POSTSUBSCRIPT represents the level of stability, or reliability, of a QTC state. The higher the number of possible neighbour states, the lower the likelihood to transition into one of them. Given an interaction at time t𝑡titalic_t, we associate its label to the next one (observed or predicted) at t+1𝑡1t+1italic_t + 1. Typically, an interaction between agents A and B is calculated as an embedding of their relative pose as follows:

IAB=Dense(XBXA)subscript𝐼𝐴𝐵Densesubscript𝑋𝐵subscript𝑋𝐴I_{AB}=\emph{Dense}(X_{B}-X_{A})italic_I start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT = Dense ( italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) (2)

where Dense()𝐷𝑒𝑛𝑠𝑒Dense(\cdot)italic_D italic_e italic_n italic_s italic_e ( ⋅ ) is a fully connected layer. The symbolic processing transforms Eq. 2 into αCNDIABsubscript𝛼𝐶𝑁𝐷subscript𝐼𝐴𝐵\alpha_{CND}~{}I_{AB}italic_α start_POSTSUBSCRIPT italic_C italic_N italic_D end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT. The QTC knowledge is domain-agnostic and therefore it can be applied to any neural network for time-series data modeling and prediction.

IV Experiments

IV-A Experimental Setup

We used a TIAGo333https://pal-robotics.com/robots/tiago/ mobile robot to monitor the motion of two people over a time period of 2 minutes. The robot was positioned at the corner of the experimental room (5m ×\times× 8.2m) and was equipped with a Velodyne VLP-16 3D LiDAR sensor, as shown in Fig. 4(a). This LiDAR features 16 scan channels, providing 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT horizontal and 30superscript3030^{\circ}30 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT vertical fields-of-view. The robot’s torso was set at the minimum height of approximately 1.21.21.21.2m to maximise the LiDAR’s chance of detecting nearby individuals. In order to track people in the scene, we run a Bayes People Tracker444https://github.com/LCAS/bayestracking [32] using point-cloud data from the LiDAR at a frequency of 10Hz. Fig. 4(c) shows an RViz screenshot with two humans tracked by the robot.

We conducted two types of experiments, illustrated in Figs. 4(a)-4(b). During these, we recorded the runtime of each inference model (SGAN baseline and NeuroSyM). We registered the rosbag file of each experiment (four in total) for offline processing. The system was running on a computer with 11th Gen Intel® Core™ i7-11800H processor and NVIDIA GeForce RTX 3080 16GB GPU. The two experiment settings are described next.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: (a) Experimental Scenario A with two humans walking parallel to each other towards their goal (room end) and back, repetitively. The online trajectory prediction is performed by models trained on the UCY-Zara01 dataset. (b) Experimental Scenario B with two humans crossing each other’s path. Here the models are trained on the THOR dataset. (c) RViz visualization of the Bayes People Tracker with human bounding-boxes extracted from the robot’s LiDAR point-clouds.

Scenario A: “all-forward” motion behaviour. Both SGAN and NeuroSyM were trained on the UCY-Zara01 pedestrians dataset [29]. The UCY dataset consists of real pedestrian trajectories with rich multi-human interactions captured at 2.5Hz. It includes three sequences (Zara01, Zara02, and UCY), taken in public spaces from top-view videos. The motion pattern of the pedestrians resembles the all-forward motion pattern replicated in our experiments (i.e. people walking in parallel directions) and illustrated in Fig. 4(a).

Scenario B: “cross-path” motion behaviour. The inference models were trained on the THOR dataset [30], which was recorded by the authors with a motion capture system at 100Hz. This indoor dataset contains motion interactions among people and their environment, including avoidance of static and dynamic obstacles (e.g. humans, robot, static objects) by people trying to reach their goal locations. Similar motion patterns were replicated in our cross-path scenario, as illustrated in Figs. 4(b) and 5.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: (Top) Full human motion trajectories in Scenario B, where two people H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (dynamic objects) move back and forth to their destinations (static objects), crossing each other’s path and avoiding collisions. (Bottom) Snapshots at frames t = 8, 10, and 16s, from left to right, respectively.

IV-B Data Processing

The ROS inference node processes the data sequentially, with an observed time window of 8 samples. Human trajectories affected by tracking errors (e.g. because of occlusions) were filtered out and not considered. The ROS inference node and the visualisation node run simultaneously, showing predicted and ground-truth trajectories at runtime. The performance comparison between the baseline SGAN model and the NeuroSyM architecture was conducted on the recorded rosbag files.

IV-C Results and Discussion

We evaluated the average accuracy of each inference model over the 2-minutes sessions of both experimental settings. The results are reported in Tables I and II, which include average displacement error (ADE) and final displacement error (FDE). These tables present 8 results in total, 4 for each scenario (A and B). We can see that, in all of the four cases, the higher accuracy achieved by NeuroSyM significantly reduced both ADE and FDE values compared to the SGAN baseline.

Scenario A rosbag 1 rosbag 2
SGAN NeuroSyM SGAN NeuroSyM
Avg. ADE (m) 12.4 7.06 16.32 2.52
Avg. FDE (m) 2.28 1.31 3.24 0.68
TABLE I: Accuracy evaluation for Scenario A, in terms of average displacement error (ADE) and final displacement error (FDE), over 2-minutes long experiments.
Scenario B rosbag 1 rosbag 2
SGAN NeuroSyM SGAN NeuroSyM
Avg. ADE (m) 10.88 5.7 24.27 9.87
Avg. FDE (m) 2.67 1.4 5 1.8
TABLE II: Accuracy evaluation for Scenario B, in terms of average displacement error (ADE) and final displacement error (FDE), over 2-minutes long experiments.

Fig. 6 illustrates some examples of ground truth and predicted trajectories in Scenarios A (top plot) and B (bottom plot), with the corresponding ADE and FDE metrics. We can clearly see that ADE and FDE are lower, in both plots, where the NeuroSyM model was used.

Refer to caption
Refer to caption
Figure 6: Examples of trajectories of two people for a single sequence, extracted from the total number of sequences generated within a 2-minutes experiment. The solid lines are the ground-truth trajectories of the predictions (8 samples each), while the dotted and the dash-dotted lines are the neuro-symbolic and the baseline predictions, respectively. The small circle is the starting point. (Top) Experimental scenario A. (Bottom) Experimental scenario B. Superscripts sym and b denote neuro-symbolic and baseline, respectively.

We also evaluated the average runtime of each inference model in both experimental scenarios. The results are reported in Table III, showing that NeuroSyM model is slightly slower than, but still comparable to, the SGAN baseline.

Scenario SGAN NeuroSyM
Average time (s) A 4.17 5.37
B 5.19 7.36
TABLE III: Runtime evaluation for the two scenarios.

From Tables III, and III, we can conclude that, although the NeuroSyM architecture requires more time to predict human trajectories compared to the SGAN baseline, it is still relatively fast and, with some code optimisation, suitable for real-time deployment. In particular, the trade-off between runtime and accuracy is clearly in favour of the NeuroSyM solution, since its QTC-based context-awareness enables more accurate motion predictions.

V Conclusion

In this work, we implemented and deployed a ROS-based architecture, called neuROSym, for neural-only and neuro-symbolic motion prediction on real-world robotic systems. Using this framework, we experimentally evaluated the accuracy and runtime performance of two predictions models, SGAN and NeuroSyM, during online inference in two scenarios with different human motion patterns. The results show a trade-off between accuracy and runtime performance in favor of the NeuroSyM solution, which is particularly suitable for human-aware robot navigation. Our future work will extend the evaluation of neuROSym to more diverse and challenging scenarios, including complex human motion patterns with multiple people. We will also consider more robust people trackers and test against different baseline architectures with static- and dynamic-context awareness.

References

  • [1] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in IEEE Int. Conf. on Robotics and Automation (ICRA), 2019.
  • [2] A. Pokle, R. Martín-Martín, P. Goebel, V. Chow, H. M. Ewald, J. Yang, Z. Wang, A. Sadeghian, D. Sadigh, S. Savarese, et al., “Deep local trajectory replanning and control for robot navigation,” in IEEE Int. Conf. on Robotics and Automation (ICRA), 2019.
  • [3] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [4] M. Lisotto, P. Coscia, and L. Ballan, “Social and scene-aware trajectory prediction in crowded spaces,” in Proc. of the IEEE/CVF Int. Conf. on Computer Vision Workshops, 2019, pp. 0–0.
  • [5] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in IEEE conf. on computer vision and pattern recognition (CVPR), 2018.
  • [6] K. Mangalam, H. Girase, S. Agarwal, K.-H. Lee, E. Adeli, J. Malik, and A. Gaidon, “It is not the journey but the destination: Endpoint conditioned trajectory prediction,” in European Conf. on Computer Vision (ECCV, 2020, pp. 759–776.
  • [7] J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and L. Fei-Fei, “Peeking into the future: Predicting future person activities and locations in videos,” in IEEE/CVF conf. on computer vision and pattern recognition (CVPR, 2019.
  • [8] Z. Cao, H. Gao, K. Mangalam, Q.-Z. Cai, M. Vo, and J. Malik, “Long-term human motion prediction with scene context,” in European Conf. on Computer Vision (ECCV), 2020, pp. 387–404.
  • [9] S. Mghames, L. Castri, M. Hanheide, and N. Bellotto, “A neuro-symbolic approach for enhanced human motion prediction,” in Int. Joint Conf. on Neural Networks (IJCNN), 2023, pp. 1–8.
  • [10] M. Delafontaine, S. H. Chavoshi, A. G. Cohn, and N. Van de Weghe, “A qualitative trajectory calculus to reason about moving point objects,” in Qualitative spatio-temporal representation and reasoning: Trends and future directions.   IGI Global, 2012.
  • [11] N. Bellotto, M. Hanheide, and N. Van de Weghe, “Qualitative design and implementation of human-robot spatial interactions,” in Int. Conf. on Social Robotics (ICSR), 2013.
  • [12] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents,” in Proc. of the IEEE Conf. on computer vision and pattern recognition (CVPR), 2017, pp. 336–345.
  • [13] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras, “Human motion trajectory prediction: A survey,” The Int. Journal of Robotics Research, vol. 39, no. 8, pp. 895–935, 2020.
  • [14] C. Tao, Q. Jiang, L. Duan, and P. Luo, “Dynamic and static context-aware lstm for multi-agent motion prediction,” in European Conf. on Computer Vision (ECCV), 2020.
  • [15] C. Yu, X. Ma, J. Ren, H. Zhao, and S. Yi, “Spatio-temporal graph transformer networks for pedestrian trajectory prediction,” in European Conf. on Computer Vision, 2020, pp. 507–523.
  • [16] Y. Huang, H. Bi, Z. Li, T. Mao, and Z. Wang, “Stgat: Modeling spatial-temporal interactions for human trajectory prediction,” in the IEEE/CVF Int. conf. on computer vision, 2019.
  • [17] N. Bisagno, C. Saltori, B. Zhang, F. G. De Natale, and N. Conci, “Embedding group and obstacle information in lstm networks for human trajectory prediction in crowded scenes,” Computer Vision and Image Understanding, 2021.
  • [18] K. Ijaz, S. Sohail, and S. Hashish, “A survey of latest approaches for crowd simulation and modeling using hybrid techniques,” in Proc. of the 17th UKSIM-AMSS Int. Conf. on Modelling and Simulation, 2015.
  • [19] H. Hedayati, A. Muehlbradt, D. J. Szafir, and S. Andrist, “Reform: Recognizing f-formations for social robots,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2020.
  • [20] S. Thompson, A. Gupta, A. W. Gupta, A. Chen, and M. Vázquez, “Conversational group detection with graph neural networks,” in Proc. of the Int. Conf. on Multimodal Interaction, 2021.
  • [21] M. Hanheide, A. Peters, and N. Bellotto, “Analysis of human-robot spatial behaviour applying a qualitative trajectory calculus,” in IEEE Int. Symposium on Robot and Human Interactive Communication (RO-MAN), 2012.
  • [22] C. Dondrup, N. Bellotto, and M. Hanheide, “A probabilistic model of human-robot spatial interaction using a qualitative trajectory calculus,” in AAAI Spring Symposium Series, 2014.
  • [23] C. Dondrup and M. Hanheide, “Qualitative constraints for human-aware robot navigation using velocity costmaps,” in IEEE Int. Symposium on Robot and Human Interactive Communication (RO-MAN), 2016.
  • [24] W. Zhao, L. Sun, C. Liu, and M. Tomizuka, “Experimental evaluation of human motion prediction toward safe and efficient human robot collaboration,” in American Control Conf. (ACC), 2020, pp. 4349–4354.
  • [25] M. Luber, J. A. Stork, G. D. Tipaldi, and K. O. Arras, “People tracking with human motion predictions from social forces,” in IEEE Int. Conf. on Robotics and Automation (ICRA), 2010, pp. 464–469.
  • [26] L.-Y. Gui, K. Zhang, Y.-X. Wang, X. Liang, J. M. Moura, and M. Veloso, “Teaching robots to predict human motion,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2018, pp. 562–567.
  • [27] G. Ferrer and A. Sanfeliu, “Behavior estimation for a complete framework for human motion prediction in crowded environments,” in IEEE Int. Conf. on Robotics and Automation (ICRA), 2014, pp. 5940–5945.
  • [28] A. Antonucci, G. P. R. Papini, P. Bevilacqua, L. Palopoli, and D. Fontanelli, “Efficient prediction of human motion for real-time robotics applications with physics-inspired neural networks,” IEEE Access, vol. 10, pp. 144–157, 2021.
  • [29] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,” in Computer graphics forum, vol. 26, no. 3, 2007, pp. 655–664.
  • [30] A. Rudenko, T. P. Kucner, C. S. Swaminathan, R. T. Chadalavada, K. O. Arras, and A. J. Lilienthal, “Thör: Human-robot navigation data collection and accurate motion trajectories dataset,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 676–682, 2020.
  • [31] N. Van de Weghe and P. De Maeyer, “Conceptual neighbourhood diagrams for representing moving objects,” in Perspectives in Conceptual Modeling, 2005, pp. 228–238.
  • [32] Z. Yan, T. Duckett, and N. Bellotto, “Online learning for human classification in 3d lidar-based tracking,” in IEEE/RSJ Int. Conf. on Intell. Robots & Systems (IROS), 2017, pp. 864–871.