This letter proposes R-VoxelMap, a novel voxel mapping method that constructs accurate voxel maps using a geometry-driven recursive plane fitting strategy to enhance the localization accuracy of online LiDAR odometry. VoxelMap and its variants typically fit and check planes using all points in a voxel, which may lead to plane parameter deviation caused by outliers, over segmentation of large planes, and incorrect merging across different physical planes. To address these issues, R-VoxelMap utilizes a geometry-driven recursive construction strategy based on an outlier detect-and-reuse pipeline. Specifically, for each voxel, accurate planes are first fitted while separating outliers using random sample consensus (RANSAC). The remaining outliers are then propagated to deeper octree levels for recursive processing, ensuring a detailed representation of the environment. In addition, a point distribution-based validity check algorithm is devised to prevent erroneous plane merging. Extensive experiments on diverse open-source LiDAR(-inertial) simultaneous localization and mapping (SLAM) datasets validate that our method achieves higher accuracy than other state-of-the-art approaches, with comparable efficiency and memory usage.
In the field of robotic manipulation, traditional methods lack the flexibility required to meet the demands of diverse applications. Consequently, researchers have increasingly focused on developing more general techniques, particularly for long-horizon and gentle manipulation, to enhance the manipulation ability and adaptability of robots. In this study, we propose a framework called VLM-Driven Atomic Skills with Diffusion Policy Distillation (VASK-DP), which integrates tactile sensing to enable gentle control of robotic arms in long-horizon tasks. The framework trains atomic manipulation skills through reinforcement learning in simulated environments. The Visual Language Model (VLM) interprets RGB observations and natural language instructions to select and sequence atomic skills, guiding task decomposition, skill switching, and execution. It also generates expert demonstration datasets that serve as the basis for imitation learning. Subsequently, compliant long-horizon manipulation policies are distilled from these demonstrations using diffusion-based imitation learning. We evaluate multiple control modes, distillation strategies, and decision frameworks. Quantitative results across diverse simulation environments and long-horizon tasks validate the effectiveness of our approach. Furthermore, real robot deployment demonstrates successful task execution on physical hardware.
In range-based SLAM systems, localization accuracy depends on the quality of geometric maps. Sparse LiDAR scans and noisy depth from RGB-D sensors often yield incomplete or inaccurate reconstructions that degrade pose estimation. Appearance and semantic cues, readily available from onboard RGB and pretrained models, can serve as complementary signals to strengthen geometry. Nevertheless, variations in appearance due to illumination or texture and inconsistencies in semantic labels across frames can hinder geometric optimization if directly used as supervision. To address these challenges, we propose AING-SLAM, an Accurate Implicit Neural Geometry-aware SLAM framework that allows appearance and semantics to effectively strengthen geometry in both mapping and odometry. A unified neural point representation with a lightweight cross-modal decoder integrates geometry, appearance and semantics, enabling auxiliary cues to refine geometry even in sparse or ambiguous regions. For pose tracking, appearance-semantic-aided odometry jointly minimizes SDF, appearance, and semantic residuals with adaptive weighting, improving scan-to-map alignment and reducing drift. To safeguard stability, a history-guided gradient fusion strategy aligns instantaneous updates with long-term optimization trends, mitigating occasional inconsistencies between appearance/semantic cues and SDF-based supervision, thereby strengthening geometric optimization. Extensive experiments on indoor RGB-D and outdoor LiDAR benchmarks demonstrate real-time performance, state-of-the-art localization accuracy, and high-fidelity reconstruction across diverse environments.
LiDAR-Inertial Odometry (LIO) is a foundational technique for autonomous systems, yet its deployment on resource-constrained platforms remains challenging due to computational and memory limitations. We propose Super-LIO, a robust LIO system that demands both high performance and accuracy, ideal for applications such as aerial robots and mobile autonomous systems. At the core of Super-LIO is a compact octo-voxel-based map structure, termed OctVox, that limits each voxel to eight subvoxel representatives, enabling strict point density control and incremental denoising during map updates. This design enables a simple yet efficient and accurate map structure, which can be easily integrated into existing LIO frameworks. Additionally, Super-LIO designs a heuristic-guided KNN strategy (HKNN) that accelerates the correspondence search by leveraging spatial locality, further reducing runtime overhead. We evaluated the proposed system using four publicly available datasets and several self-collected datasets, totaling more than 30 sequences. Extensive testing on both X86 and ARM platforms confirms that Super-LIO offers superior efficiency and robustness, while maintaining competitive accuracy. Super-LIO processes each frame approximately 73% faster than SOTA, while consuming less CPU resources. The system is fully open-source and compatible with a wide range of LiDAR sensors and computing platforms. The implementation is available at: https://github.com/Liansheng-Wang/Super-LIO.git.
Drone technology is proliferating in many industries, including agriculture, logistics, defense, infrastructure, and environmental monitoring. Vision-based autonomy is one of its key enablers, particularly for real-world applications. This is essential for operating in novel, unstructured environments where traditional navigation methods may be unavailable. Autonomous drone racing has become the de facto benchmark for such systems. State-of-the-art research has shown that autonomous systems can surpass human-level performance in racing arenas. However, the direct applicability to commercial and field operations is still limited, as current systems are often trained and evaluated in highly controlled environments. In our contribution, the system's capabilities are analyzed within a controlled environment—where external tracking is available for ground-truth comparison—but also demonstrated in a challenging, uninstrumented environment—where ground-truth measurements were never available. We show that our approach can match the performance of professional human pilots in both scenarios.
The divergent component of motion (DCM) quantifies the unstable component of a robot's motion and is widely used for online motion planning and stabilization in robots. Although the DCM is derived from the linear inverted pendulum model, which admits a closed-form solution, extending it to nonlinear, height-varying pendulum dynamics complicates the derivation and typically requires iterative computation or piecewise linearization, resulting in a high computational cost. This letter introduces an energy-based derivation that enables the derivation of the DCM in closed form, even when the pendulum height varies, allowing for the flexible generation of curved paths that include both convex and concave shapes without requiring piecewise linearization. The proposed method is validated through comparative pendulum simulations and through its application to bipedal walking and a balance-assist motorcycle. The results confirm that the proposed method successfully enables the DCM to be expressed for nonlinear trajectories, allowing a conventional DCM-based stabilizer to be applied directly.
Accurate yet low-latency depth is essential for radar-camera perception in autonomous systems. Cameras provide rich appearance but lack metric scale, whereas automotive radar offers metric range but is sparse and noisy. Many pipelines are multi-stage or depend on auxiliary annotations, increasing latency and limiting portability. We introduce JustDepth, a single-stage radar-camera depth estimator trained only with radar, camera, and single-scan LiDAR. All radar returns are aggregated into a fixed-width 1D representation, decoupling runtime from point count. A Height Fusion Block fuses modalities, a lightweight GNN propagates depth globally, and a training-only confidence decoder stabilizes learning with zero test-time cost. We mitigate stripe artifacts via simple augmentations and quantify them using the Vertical-Horizontal Gradient Ratio (VHGR). On nuScenes, compared to recent state-of-the-art methods, JustDepth maintains accuracy while reducing inference time by $39.7\times$ and stripe artifacts by 66% as measured by VHGR.
Control Barrier Functions (CBFs) provide a powerful framework for enforcing real-time safety in control systems and have seen increasing applications in safety-critical domains, such as surgical robotics. In vitreoretinal microsurgery, where precision and tissue protection are crucial, we propose a shared control approach that leverages CBFs to maintain the robot's end-effector within a safe zone above the retina. Using real-time 3D reconstruction from an instrument-integrated Optical Coherence Tomography (iiOCT) system mounted on the surgical tool, we define a safety band between two offset surfaces derived from the reconstructed retina. A hybrid controller drives the tool into the band when outside and then enforces forward invariance using a CBF-based quadratic program. Concurrently, haptic feedback proportional to the deviation from the band centre guides the surgeon toward the optimal working distance. We validate our method in ex vivo pig eye experiments, performing a simulated Vitreous Shaving (VS), showing improved safety and operator awareness.
Detecting incipient slip enables early intervention to prevent object slippage and enhance robotic manipulation safety. Achieving this requires tactile interfaces with compliant, high-sensitivity skins that amplify local deformation differences and tactile sensing pipelines that preserve and exploit high-resolution spatio-temporal signals. This work presents an event-based tactile sensing system for incipient slip detection, built on the NeuroTac sensor, featuring an extruding papillae-based skin and a spiking convolutional neural network (SCNN). The SCNN achieves 94.33% accuracy in three-class classification (no slip, incipient slip, and gross slip) during slip conditions induced by sensor motion, delivering slightly higher accuracy than its artificial neural network (ANN) counterparts while maintaining a low theoretical computational cost. Under dynamic gravity-induced slip validation settings, after temporal smoothing of the SCNN’s final-layer spike counts, the system detects incipient slip 360–1840 ms before actual gross slip across all trials. These results demonstrate the effectiveness of the proposed system and highlight its potential for future neuromorphic hardware deployment.
Digital twin applications offered transformative potential by enabling real-time monitoring and robotic simulation through accurate virtual replicas of physical assets. The key to these systems is 3D reconstruction with high geometrical fidelity. However, existing methods struggled under field conditions, especially with sparse and occluded views. This study developed the DATR framework for apple tree reconstruction from sparse views. DATR automatically generates tree masks from field images using onboard sensors and foundation models, which filter background information from multi-modal data for reconstruction. Leveraging the multi-modality, a diffusion model generates novel views and a large reconstruction model produces implicit neural fields. Geometric priors enable scale retrieval to align output with real-world dimensions. Both models were trained using realistic synthetic apple trees generated by a Real2Sim data generator. The framework was evaluated on both field and synthetic datasets. The field dataset includes six apple trees with field-measured ground truth, while the synthetic dataset featured structurally diverse trees. The evaluation results indicated that our DATR framework surpassed existing 3D reconstruction techniques on both datasets, providing a favorable balance of accuracy and efficiency. It achieved domain-trait estimation performance on par with industrial-grade stationary laser scanners, while increasing scan throughput by approximately 1000×, thereby underscoring its strong promise for scalable agricultural digital twin systems.
Zero-shot object navigation (ZSON) in unseen environments poses a significant challenge due to the absence of object-specific priors and the need for efficient exploration. Existing approaches often struggle with ineffective search strategies and repeated visits to irrelevant areas. In this letter, we introduce a curiosity-driven framework that leverages the commonsense reasoning capabilities of vision-language models (VLMs) to guide exploration. At each step, the agent estimates the semantic plausibility of regions based on language-conditioned visual cues, constructing a dynamic value map that promotes informative regions and suppresses redundancy. The core contribution of this work is integrating VLM-based scene understanding into the curiosity mechanism, enabling the agent to make human-like judgments about environmental relevance during navigation. Extensive experiments on the HM3D benchmark show that our method achieves a 12.1% absolute improvement in Success Rate (SR) over strong baselines (from 56.5% to 68.6%). Qualitative analysis further confirms that the proposed strategy leads to more efficient and goal-directed exploration.
Robust and efficient point cloud registration is fundamental to robotic perception tasks, such as SLAM, 3D reconstruction, and autonomous navigation. Traditional methods often rely on fragile point correspondences and iterative optimization, struggling under conditions like poor initialization, partial overlap, and perceptual aliasing. We propose Complex Kernel Registration (CKR), a novel framework that replaces fragile point-level correspondences with a robust alignment of high-level block descriptors. CKR operates by establishing coarse matches between local block descriptors in a high-dimensional complex Hilbert space. CKR extracts compact 8D descriptors capturing both geometric structure and LiDAR reflectance, embeds them via Random Fourier Features, and estimates rigid transformations using Wirtinger gradient descent. This formulation bypasses explicit point matching, yielding fast, accurate, and robust registration, even in cluttered, occluded, and low-texture environments. Extensive experiments on real-world LiDAR datasets demonstrate that CKR outperforms classical and learning-based baselines in accuracy, robustness, and computational efficiency, which is practical for deployment on resource-constrained robotic platforms without specialized hardware.
Cameras and radar sensors are complementary in 3D object detection in that cameras specialize in capturing an object's visual information while radar provides spatial information and velocity hints. Existing radar-camera fusion methods often employ a symmetrical architecture that processes inputs from cameras and radar indiscriminately, hindering the full leverage of each modality's distinct advantages. To this end, we propose RaCFusion, a radar-camera fusion framework that leverages the camera stream as the main detector and improves it via Radar-assisted hierarchical refinement. Technically, the Radar-assisted refinement is performed via two specifically designed modules. Firstly, in the Radar-assisted Query Generation module, the initial object queries of the image branch are augmented with the spatial information obtained from radar data, formulating enhanced hybrid object queries. These hybrid object queries are used to interact with the image features in the transformer decoder to generate object-centric query features. Subsequently, within the Radar-assisted Velocity Aggregation module, these query features undergo further refinement through the incorporation of Radar-assisted velocity features. These velocity features are meticulously learned from the nuanced relationships between the queries and radar features, thereby diminishing the error in velocity estimation by utilizing the valuable velocity clues from the radar sensor. RaCFusionachieves competitive performance among radar-camera-fusion 3D detectors on the nuScenes benchmark.
Deformable objects (DOs) are prevalent in everyday environments and represent important targets for robotic manipulation. However, their high degrees of freedom and complex nonlinear deformations make them more challenging to model and control than rigid objects when relying on traditional analyticalapproaches. To address this, we propose a data-driven method to model the dynamics of deformable objects. Our method utilizes time-series data to predict future states without relying on complex dynamics. We employ model predictive control (MPC) for robot manipulation and improve its performance through online updates of the data-driven model. To handle cables with varying configurations, interpolation is applied to align model input structures. In this study, we focus on manipulating deformable linear objects (DLOs) with different mechanical properties and configurations using a dual-arm robotic system, both in simulation and in real-world environments.
The development of orthopedic robots for epiphyseal surgery is still in its nascent stages, with lacking considerations for autonomous path planning and biological force. The objective of this study is to develop an autonomous grinding path planning algorithm and to design a trajectory optimization algorithm aimed at reducing the biological force during grinding, thereby enhancing the safety of the surgery. Initially, based on the CT images of bone bridge medical scans, a point-cloud-based layered path grinding planning method was constructed. Subsequently, by introducing the constraints of grinding force and combining with grinding time and smoothness, this problem is modeled as a multi-objective optimization problem, providing diverse and high-quality solutions for robotic bone grinding. Finally, the existing NSGA-II multi-objective genetic algorithm is improved to enhance the diversity and convergence of the optimized grinding trajectory solutions. To verify the effectiveness of the proposed method, robotic bone grinding experiments were conducted. The Dice scores of the bone bridge grinding results exceeded 90%, ensuring removal of the bone bridge.
Freespace detection in autonomous driving is limited by the lack of explicit geometric modeling, hindering generalization across complex terrains. Existing approaches are predominantly data-driven and neglect the physical structure of drivable surfaces. We propose Terrain Flat (TerrFlat), a physics-driven geometric representation that models road surfaces along three interpretable dimensions: lateral smoothness, longitudinal consistency, and vertical deviation. TerrFlat is constructed through geometric reasoning and projected into pixel-aligned maps via a differentiable projection, ensuring geometric–visual consistency. Building on this representation, we introduce a symmetric feature fusion module (SFFM) to integrate TerrFlat with visual features through bidirectional recalibration, improving semantic discrimination and boundary localization. Together, TerrFlat and SFFM form TerrFlat-Seg, a unified framework for physics-aware freespace perception. Experiments on KITTI-Road, Semantic-KITTI, and the off-road ORFD datasets demonstrate consistent improvements over existing baselines. Real-world validation on an automated guided vehicle platform further confirms the robustness of our approach.
Autonomous vehicles must recognize their surroundings and distinguish between dynamic and static objects to avoid collisions. Most recent moving object segmentation (MOS) studies project LiDAR point-cloud streams into multiple views to capture spatio-temporal cues. When a single view proves insufficient, the common strategy is to combine or convert views. However, existing methods rebuild features in full 3D space during conversion, significantly increasing time and memory costs. They also depend on high-end graphics processing units, which limits their on-board deployment. We introduce SwiftMOS, a lightweight framework centered on a direct view transformation module that rapidly converts bird's-eye view to range view without restoring 3D features. Direct view transformation leverages precomputed grids and view-specific hint maps to efficiently map coordinates, and an addition-fusion feature flow quickly merges past and current frames to capture the spatio-temporal context. Extensive quantitative and qualitative experiments on the SemanticKITTI dataset confirm the validity and real-time capability of the proposed method. SwiftMOS achieves a substantial latency reduction while maintaining competitive performance. It operates within the typical LiDAR scan rate on on-board hardware and also achieves robust performance on the Sipailou Campus Dataset, underscoring its practicality for real-world autonomous driving.
Soft robots excel in adaptability, and rigid robots in precise, forceful actuation, yet each faces limitations that reduce performance in cluttered or constrained settings. We address these gaps by enabling physical coupling between a soft Vine robot and centimeter-scale RESCUE roller robots, allowing heterogeneous teams to work cooperatively to leverage each other’s strengths. Our approach improves mobility, enables power sharing, and communication through mechanical interfaces, modular chain formation, and high-speed recharging and data transfer via electrical contact using magnetic connectors. RESCUE rollers can form chains for collaborative locomotion and load distribution, and seamlessly integrate with the Vine robot to extend its workspace. In return, the Vine robot serves as a mobile power and communication hub, recharging rollers and relaying data to extend operational time. Experimental validation includes measurements of pushing and pulling strength in roller chains (Section III-A), Vine robot buckling under load (Section III-B), and demonstrations of Vine robot and RESCUE rollers interactions in a cluttered environment (Section IV-A), tight turning of the Vine robot using rollers (Section IV-B), and in-field recharging of RESCUE rollers (Section IV-C). This work advances cooperative strategies for heterogeneous robotic systems, enabling adaptable and resilient performance in challenging environments.
In many robotics applications, it is necessary to compute not only the distance between the robot and the environment, but also its derivative - for example, when using control barrier functions. However, since the traditional Euclidean distance is not differentiable, meaning it is not guaranteed to be differentiable everywhere, there is a need for alternative distance metrics that possess this property. Recently, a metric with guaranteed differentiability was proposed [1]. This approach has some important drawbacks, which we address in this letter. We provide much simpler and practical expressions for the smooth projection for general convex polytopes. Additionally, as opposed to [1], we ensure that the distance vanishes as the objects overlap. We show the efficacy of the approach in experimental results. Our proposed distance metric is publicly available through the Python-based simulation package UAIBot.
Robotic surgical systems rely on visual input to perform critical tasks such as tool manipulation and gesture recognition, where image quality directly affects performance. However, conventional image quality assessment (IQA) methods fail to reflect how degradations impact robotic performance. We propose Task-Driven Embodied Vision Quality Assessment (TD-EVQA), a novel framework that evaluates image quality based on its effect on downstream robotic task performance rather than human perception. To support this, we introduce SurgDegrade, a large-scale benchmark of degraded surgical images with quality labels derived from performance degradation in robotic tasks. By simulating real-world degradations at varying levels, we establish a fine-grained, task-aware quality supervision signal. To learn this quality-performance mapping, TD-EVQA adapts frozen vision-language foundation models via prompt tuning. The model incorporates static context prompts, instance-conditioned adaptation, and knowledge-guided regularization to infer task-relevant quality scores. Experiments show that TD-EVQA surpasses existing IQA methods in predicting performance degradation under visual distortion. Our approach redefines visual quality from a functional perspective, enhancing safety and reliability in robotic surgical systems.
Non-prehensile manipulation is challenging due to complex contact interactions between objects, the environment, and robots. Model-based approaches can efficiently generate complex trajectories of robots and objects under contact constraints. However, they tend to be sensitive to model inaccuracies and require access to privileged information (e.g., object mass, size, pose), making them less suitable for novel objects. In contrast, learning-based approaches are typically more robust to modeling errors but require large amounts of data. In this letter, we bridge these two approaches to propose a framework for learning closed-loop non-prehensile manipulation. By leveraging computationally efficient Contact-Implicit Trajectory Optimization (CITO), we design demonstration-guided deep Reinforcement Learning (RL), leading to sample-efficient learning. We also present a sim-to-real transfer approach using a privileged training strategy, enabling the robot to perform non-prehensile manipulation using only proprioception, vision, and force sensing without access to privileged information. Our method is evaluated on several tasks, demonstrating that it can successfully perform zero-shot sim-to-real transfer.
In industrial and medical environments, robotic manipulation frequently encounters dual challenges of spatially constrained workspaces and visual obstructions. Wireless actuation methodologies, leveraging non-contact energy transmission and adaptive control mechanisms, offer an innovative solution to address the limitations of physical interconnections and enhance operational adaptability in structurally confined and uncertain obstructed environments. Here, we implemented a non-contact continuum robotic arm based on microwave-driving with polarization-directed guidance. The system incorporates customized flexible printed circuit (FPC) antennas and spatial microwave energy field modulation to enable wireless actuation and multi-degree-of-freedom (DOF) motion control of the robotic end-effector. By integrating spring-supported mechanisms and shape memory alloy (SMA) spring deformation responses, it accomplishes tasks such as obstacle penetration, component grasping, and retrieval—all without physical connections. This study demonstrates precise coupling between directionally controlled microwave energy and mechanical motion in obscured settings, offering a novel approach for non-contact, multi-DOF, and multi-structure robotic operations in sealed or unstructured environments. The proposed methodology significantly expands the potential applications and operational capabilities of robots in complex real-world conditions.
Motivated from pigeons flying in flocks, this paper delves into the outdoor flocking flight using pigeon-inspired flapping-wing robots (FWRs). Current research and implementation of cooperative flight for FWRs is scarce, generally exhibiting constrained operational environments, small number of individuals, and poor flight performance. This study aims to construct a stable outdoor flocking system based on our self-developed small FWR (USTB-Dove, 0.8 m wingspan). The objective is to achieve superior 3D flocking flight with multiple FWRs outdoors. To achieve this, we first improve the USTB-Dove to enhance its payload capacity and controllability. The upgraded platform, featuring a redesigned wing and tail, is capable of cruising for 30 minutes with a 65-gram payload. Additionally, within a leader-follower architecture, we design a guidance law and a velocity control model for the followers. To achieve superior leader tracking, the model is developed through physics-based modeling and parameter optimization of the actual FWR and is solved using the Newton-Raphson method. Furthermore, we develop a flocking architecture with distributed controllers and centralized monitoring. Virtual leader approach is adopted to minimize communication overhead, enabling stable 1-km operation with a system capacity of 31 FWRs. Ultimately, we conduct outdoor dual-FWR test to validate the leader-tracking capability. Then five-FWR test demonstrates system's stability and potential for scalability. The diameter of five-FWR 3D flock is controlled at approximately 10 m with an altitude stability of $35 \pm 3$ m, which is, to our knowledge, the best performance achieved in published outdoor flocking flight research of FWRs.
Achieving high-fidelity modeling of robot dynamics remains a major challenge in scientific computing and artificial intelligence. Black-box models are prone to overfitting and lack interpretability, while continuous Lagrangian-based Physics-Informed Neural Networks (PINNs) require differentiated data from position measurements, increasing acquisition complexity and introducing noise that degrades performance. To address these issues, we propose a Structured Discrete Lagrangian Neural Network (StrDLNN), grounded in the discrete Lagrangian formulation and requiring only position data. Experiments demonstrate that StrDLNN recovers the underlying physical structure from position-only observations, achieving superior data efficiency, extrapolation capability, and robustness to sampling frequency compared with existing baselines. When applied to a UR10e robot, StrDLNN incorporates a friction model and accurately identifies both overall dynamics and friction parameters without friction labels, achieving a coefficient of determination $R^{2}$ of 0.94 in dynamics modeling.
Soft everting robots present significant advantages over traditional rigid robots, including enhanced dexterity, improved environmental interaction, and safe navigation in unpredictable environments. While soft everting robots have been widely demonstrated for exploration type tasks, their potential to move and deliver payloads in such tasks has been less investigated, with previous work focusing on sensors and tools for the robot. Leveraging the navigation capabilities, and deployed body, of the soft everting robot to deliver payloads in hazardous areas, e.g. carrying a water bottle to a person stuck under debris, would represent a significant capability in many applications. In this work, we present an analysis of how soft everting robots can be used to deliver larger, heavier payloads through the inside of the robot. We analyze both what objects can be delivered and what terrain features they can be transported through. Applying existing models, we present methods to quantify the effects of payloads on robot growth and self-support, and we modified model to predict payload slippage. We then experimentally quantify payload transport using a soft everting robot with a variety of payload shapes, sizes, and weights and though a series of tasks: delivery, steering, vertical transport, movement through holes, and movement across gaps. We can transport payloads in various shapes and up to $1.5 \,{\mathrm{k}\mathrm{g}}$ in weight. We show that unsteered growth can deliver objects with high positional accuracy and precision in uncluttered environments and moderate accuracy in cluttered environments. We can move through circular apertures with as little as $0.01 \,\mathrm{c}\mathrm{m}$ clearance around payloads. We can carry out discrete turns of up to $135 \,\mathrm{^{\circ }}$ and move across unsupported gaps of up to $1.15 \,\mathrm{m}$ in length.
Deployable service and delivery robots struggle to navigate multi-floor buildings to reach object goals, as existing systems fail due to single-floor assumptions and requirements for offline, globally consistent maps. Multi-floor environments pose unique challenges including cross-floor transitions and vertical spatial reasoning, especially navigating unknown buildings. Object-Goal Navigation benchmarks like HM3D and MP3D also capture this multi-floor reality, yet current methods lack support for online, floor-aware navigation. To bridge this gap, we propose ASCENT, an online framework for Zero-Shot Object-Goal Navigation that enables robots to operate without pre-built maps or retraining on new object categories. It introduces: 1) a Multi-Floor Abstraction module that dynamically constructs hierarchical representations with stair-aware obstacle mapping and cross-floor topology modeling, and 2) a Coarse-to-Fine Reasoning module that combines frontier ranking with LLM-driven contextual analysis for multi-floor navigation decisions. We evaluate on HM3D and MP3D benchmarks, outperforming state-of-the-art zero-shot approaches, and demonstrate real-world deployment on a quadruped robot.
Manipulating clusters of deformable objects presents a substantial challenge with widespread applicability, but requires contact-rich whole-arm interactions. A potential solution must address the limited capacity for realistic model synthesis, high uncertainty in perception, and the lack of efficient spatial abstractions, among others. We propose a novel framework for learning model-free policies integrating two modalities: 3D point clouds and proprioceptive touch indicators, emphasising manipulation with full body contact awareness, going beyond traditional end-effector modes. Our reinforcement learning framework leverages a distributional state representation, aided by kernel mean embeddings, to achieve improved training efficiency and real-time inference. Furthermore, we propose a novel context-agnostic occlusion heuristic to clear deformables from a target region for exposure tasks. We deploy the framework in a power line clearance scenario and observe that the agent generates creative strategies leveraging multiple arm links for de-occlusion. Finally, we perform zero-shot sim-to-real policy transfer, allowing the arm to clear real branches with unknown occlusion patterns, unseen topology, and uncertain dynamics.
Mechanical compliance is a key design parameter for dynamic contact-rich manipulation, affecting task success and safety robustness over contact geometry variation. Design of soft robotic structures, such as compliant fingers, requires choosing design parameters which affect geometry and stiffness, and therefore manipulation performance and robustness. Today, these parameters are chosen through either hardware iteration, which takes significant development time, or simplified models (e.g. planar), which can’t address complex manipulation task objectives. Improvements in dynamic simulation, especially with contact and friction modeling, present a potential design tool for mechanical compliance. We propose and investigate feasibility of a simulation-based design tool for compliant mechanisms which allows design with respect to task-level objectives, such as success rate. This is applied to optimize design parameters of a structured compliant finger to reduce failure cases inside a tolerance window in insertion tasks. The improvement in robustness is then validated on a real robot using tasks from the benchmark NIST task board. The finger stiffness affects the tolerance window: optimized parameters can increase tolerable ranges by a factor of 2.29, with workpiece variation up to $\text{8.6} \ \text{mm}$ being compensated. However, the trends remain task-specific. In some tasks, the highest stiffness yields the widest tolerable range, whereas in others the opposite is observed, motivating need for design tools which can consider application-specific geometry and dynamics. While the task simulation demonstrates a high accuracy for simple scenarios, achieving a 100% failure mode prediction, tasks with higher geometric complexity yield significant discrepancies between simulation and real-life, resulting in lower rates of 27.3% due to excessive simplifications of the simulated task. Failure during insertion can be modeled with a small sim2real gap, whereas failure during search and in-hand slip demonstrated a higher sim2real gap.
6-DoF posture estimation is a critical technique in robotics. However, a significant gap exists between the two primary approaches—camera-based methods and laser tracker systems—in terms of cost and performance. To bridge this gap, this work proposes a method to calculate the 6-DoF posture of a manipulator’s end-effector using the 2D profile of a custom-designed metallic gauge. The core principle relies on a one-to-one correspondence between the measured profile and the manipulator’s posture. A mathematical model was developed to derive a closed-form solution, which is further refined via maximum likelihood estimation to enhance robustness and accuracy. Simulation studies assessed the influence of geometric parameters on estimation accuracy and noise robustness. Real-world experiments demonstrated that the refined solution significantly outperforms the closed-form solution alone. Furthermore, a speed benchmark against a camera-based system highlighted the proposed method’s advantage in operating frequency. Finally, the method was successfully integrated into a real-time position control task for a 6-axis industrial manipulator, verifying its practical applicability in real-time robotic control.
We study a problem of multi-agent exploration with behaviorally heterogeneous robots. Each robot maps its surroundings using SLAM and identifies a set of areas of interest (AoIs) or frontiers that are the most informative to explore next. The robots assess the utility of going to a frontier using Behavioral Entropy (BE) and then determine which frontier to visit via a distributed task assignment scheme. We convert the task assignment problem into a non-cooperative game and use a distributed algorithm (d-PBRAG) to converge to the Nash equilibrium, which is shown to be the optimal task allocation solution. For unknown utility cases, we provide robust bounds using approximate rewards. We test our algorithm (which has less communication cost and fast convergence) in simulation, where we explore the effect of sensing radii, sensing accuracy, and heterogeneity among robotic teams with respect to the time taken to complete exploration and path traveled. We observe that having a team of agents with heterogeneous behaviors is beneficial.
Aerial transportation robots using suspended cables have emerged as versatile platforms for disaster response and rescue operations. To maximize the capabilities of these systems, robots need to aggressively fly through tightly constrained environments, such as dense forests and structurally unsafe buildings, while minimizing flight time and avoiding obstacles. Existing methods geometrically over-approximate the vehicle and obstacles, leading to conservative maneuvers and increased flight times. We eliminate these restrictions by proposing PolyFly, an optimal global planner which considers a non-conservative representation for aerial transportation by modeling each physical component of the environment, and the robot (quadrotor, cable and payload), as independent polytopes. We further increase the model accuracy by incorporating the attitude of the physical components by constructing orientation-aware polytopes. The resulting optimal control problem is efficiently solved by converting the polytope constraints into smooth differentiable constraints via duality theory. We compare our method against the existing state-of-the-art approach in eight maze-like environments and show that PolyFly produces faster trajectories in each scenario. We also experimentally validate our proposed approach on a real quadrotor with a suspended payload, demonstrating the practical reliability and accuracy of our method.
Autonomous robots performing navigation tasks in complex environments often face significant challenges due to uncertainty in state estimation. In settings where the robot faces significant resource constraints and accessing high-precision localization comes at a cost, the planner may have to rely primarily on less precise state estimates. Our key observation is that different tasks, and different portions of tasks require varying levels of precision in different regions: a robot navigating a crowded space might need precise localization near obstacles but can operate effectively with less precise state estimates in open areas. In this paper, we present a planning method for integrating task-specific uncertainty requirements directly into navigation policies. We introduce Task-Specific Uncertainty Maps (TSUMs), which abstract the acceptable levels of state estimation uncertainty across different regions. TSUMs align task requirements and environmental features using a shared representation space, generated via a domain-adapted encoder. Using TSUMs, we propose $\mathbf{G}$eneralized $\mathbf{U}$ncertainty $\mathbf{I}$ntegration for $\mathbf{D}$ecision-Making and $\mathbf{E}$xecution (GUIDE), a policy-conditioning framework that incorporates these uncertainty requirements into the robot’s decision-making process, enabling the robot to reason about the context-dependent value of certainty and adapt its behavior accordingly. We show how integrating GUIDE into reinforcement learning frameworks allows the agent to learn navigation policies that effectively balance task completion and uncertainty management without the need for explicit reward engineering. We evaluate GUIDE on a variety of real-world robotic navigation tasks and find that it demonstrates significant improvement in task completion rates compared to baseline methods that do not explicitly consider task-specific uncertainty.
Pedestrian detection is a critical task in robot perception. Multispectral modalities (visible light and thermal) can boost pedestrian detection performance by providing complementary visual information. Several gaps remain with multispectral pedestrian detection methods. First, existing approaches primarily focus on spatial fusion and often neglect temporal information. Second, RGB and thermal image pairs in multispectral benchmarks may not always be perfectly aligned. Pedestrians are also challenging to detect due to varying lighting conditions, occlusion, etc. This work proposes Strip-Fusion, a spatial-temporal fusion network that is robust to misalignment in input images, as well as varying lighting conditions and heavy occlusions. The Strip-Fusion pipeline integrates temporally adaptive convolutions to dynamically weigh spatial-temporal features, enabling our model to better capture pedestrian motion and context over time. A novel Kullback–Leibler divergence loss was designed to mitigate modality imbalance between visible and thermal inputs, guiding feature alignment toward the more informative modality during training. Furthermore, a novel post-processing algorithm was developed to reduce false positives. Extensive experimental results show that our method performs competitively for both the KAIST and the CVC-14 benchmarks. We also observed significant improvements compared to previous state-of-the-art on challenging conditions such as heavy occlusion and misalignment.
Safety is a critical concern in motion planning for autonomous vehicles. Modern autonomous vehicles rely on neural network-based perception, but making control decisions based on these inference results poses significant safety risks due to inherent uncertainties. To address this challenge, we present a distributionally robust optimization (DRO) framework that accounts for both aleatoric and epistemic perception uncertainties using evidential deep learning (EDL). Our approach introduces a novel ambiguity set formulation based on evidential distributions that dynamically adjusts the conservativeness according to perception confidence levels. We integrate this uncertainty-aware constraint into model predictive control (MPC), proposing the DRO-EDL-MPC algorithm with computational tractability for autonomous driving applications. Validation in the CARLA simulator demonstrates that our approach maintains efficiency under high perception confidence while enforcing conservative constraints under low confidence.
Robotic centering grasp and continuous in-hand manipulation for slender objects are challenging. This article presents a manipulator with a stable gripping mechanism. The object can be registered automatically. Based on the bionic freedom layout and transmission mechanism, opening/closing, rotating, and pushing were realized. The decoupling mechanism enables motor rear-mounting, avoiding cable winding, space occupied by the motor, and the motor itself becoming load inertia. The continuity of operation was realized by active surface and dual-path transmission. The kinematics was verified in the prototype experiment. The prototype can grip the objects 0.5–3.5 mm in diameter with a maximum gripping force of 108 N, a position error better than 1 mm w3and a direction error of about 2°. The prototype can operate a medical guide wire with a diameter of 0.035 inches for continuous displacement, rotation, and spiral motion. The positioning and motion performance were evaluated experimentally, and combined with demonstration can be considered to have application potential. Compared to others, it achieves continuous in-hand manipulation of slender objects for the first time.
To address the challenges of insufficient wind resistance stability and low detection accuracy for strand breakage defects during overhead power line inspection, this study proposes a dual-mode inspection robot solution integrating a Dynamic Conforming Pressing (DCP) mechanism and an infrared vision system. Firstly, a dual-mode inspection robot with flight-to-walking mode transition capability was designed and developed. The robot incorporates a multi-spring buffered DCP structure to enhance operational stability under wind loads. Subsequently, an efficient infrared vision detection system was established based on the YOLO11 framework, integrating the ShuffleNetV2 lightweight backbone network with BiFPN multi-scale feature fusion mechanisms, and validated on a strand breakage defect dataset. Experimental results indicate that the DCP-equipped robot reduces stabilization time by over 40% under wind loads of 4 m/s, 8 m/s, and 12 m/s. Meanwhile, the enhanced YOLO11 model achieves a mean Average Precision (mAP) of 0.991 with an inference speed of 85.26 FPS, surpassing the performance of Faster R-CNN, YOLOv5, and YOLOv8. This research demonstrates the effective integration of mechanical stability and intelligent perception capabilities, providing critical technical support for advancing the intelligence and reliability of power line inspection systems.
Multilink transformable rotorcraft demonstrate exceptional flexibility when navigating confined spaces, yet face critical challenges including time-varying center of gravity, body misalignment, and the absence of a unified control strategy during dynamic reconfiguration, which severely restrict motion continuity and operational capability. To address these limitations, we propose an H-shaped multi-modal transformable rotorcraft. Its novelty lies in utilizing a designed H-shaped variable structure with 6 controllable degrees of freedom (CDOF), achieving a balance between structural flexibility and the resolution of time-varying center-of-gravity limitations. It proposes a control parameter tuning method based on motion characteristic values and a power distribution route based on the principle of competence. Under limited computational overhead, it realizes narrow-gap flight and single-propeller power failure response through dynamic configuration reconstruction. Experimental results demonstrate that our platform successfully overcomes stability challenges, achieving a position deviation of less than 4 cm when traversing constrained spaces, it can reduce its lateral width by more than 50%, and can switch to a fault-tolerant configuration within 1.2 s to respond to sudden single-propeller power failures while maintaining stable flight. Additionally, it possesses torque output capability for twist operations during manipulation tasks. The H-shaped transformable rotorcraft enables multi-modal morphing control and smooth power transitions, providing a versatile platform for multi-scenario flight operations.
Passive electric sense, a unique underwater perception ability in aquatic species, enables fish to detect external electrical signals and perform essential life activities in dark or turbid environments. Inspired by the passive electrosensing capability of natural shark, this letter proposes a low-cost hardware solution for the robotic shark’s passive electric sense system to detect weak underwater electric fields. Furthermore, motion-induced electrode displacement generates dynamic electric field noise that interferes with measurements. We analyze the properties of this noise which is correlated with the body waves of the robotic shark, and then propose a suppression method based on the noise analysis. Next, simulation tests are conducted to demonstrate the time and frequency properties of dynamic electric field noise. Finally, through physical experiments, the robotic shark with passive electric sense system has successfully detected $\mu$V-level underwater electric fields, and the proposed suppression method has improved the signal-to-noise ratio from 6.697dB to 9.445dB compared to the unfiltered situation.
Autonomous aerial robots require accurate and efficient local environment representations to enable safe and agile navigation in cluttered and unknown environments. In this letter, we propose PolyMap, an online local polyhedral mapping-planning framework designed for aerial robots. PolyMap represents obstacles as a set of convex polyhedra constructed directly from raw point cloud, providing a compact and planner-friendly geometric abstraction. A novel concavity-aware decomposition algorithm is introduced to partition non-convex point cloud clusters into tightly fitting convex subcomponents, significantly reducing conservativeness while maintaining computational efficiency. Furthermore, we employ standard dual representation of convex polyhedra to achieve fast collision checking and enable seamless integration with optimization-based motion planners.High-fidelity simulations and real-world experiments are conducted to demonstrate the effectiveness and practicality of the proposed method.
Soft robotic actuators present tradeoffs between mechanical compliance and practical capabilities, motivating new actuation strategies that harness the storage and rapid release of elastic energy for high-performance movements. We present a servo-driven, soft linear actuator that integrates compact mechanical clutches with architected elastomers based on handed shearing auxetics (HSAs). The actuator stores elastic strain energy during servo-driven HSA extension that can be released upon clutch disengagement to drive rapid contraction. Altogether, the actuator achieves a max pulling force of 184 N, actuation strain of 23%, and contraction speed of 1.55 m s-1. A comparative characterization of performance with and without clutching reveals a nearly ten times increase in peak power with clutch-enabled actuation, while maintaining intrinsic compliance and back-drivability. When integrated into an artificial musculoskeletal system like a human-scale robot leg, the actuators enable dynamic tasks through clutch disengagement, such as ball kicking and ground push-off. Rapid clutch re-engagement can also stabilize motion by suppressing rebound oscillations. We evaluate the leg’s performance in these tasks, including transferred mechanical work, energy efficiency, and force ouput. The results demonstrate the actuator’s ability to generate fast, powerful motions and potential to advance artificial musculoskeletal systems for bioinspired robots.
In this work, we propose a novel artificial vector field for robot navigation in $n$-dimensional path-following tasks, designed to ensure safety and convergence with a smoothed control law. Unlike previous methods based on discontinuous Euclidean distance functions, our approach uses a smooth Euclidean-like function to achieve a continuous control law formulation and a field combination to balance the objectives of avoiding obstacles and following the path. This results in a navigation method that follows a target path while preventing robots from approaching obstacles, which can be used in different applications. We provide formal proofs for safety using barrier functions concepts and path convergence via Lyapunov theory. The methodology is validated through extensive numerical simulations and real-world experiments. Those include extrapolations of the methodology in more complex cases, such as quadcopters and multi-robot systems to underline the method's advantages in achieving safe and reliable robot navigation.
Monocular omnidirectional visual odometry (OVO) systems leverage 360-degree cameras to overcome field-of-view limitations of perspective VO systems. However, existing methods, reliant on handcrafted features or photometric objectives, often lack robustness in challenging scenarios, such as aggressive motion and varying illumination. To address this, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%.
Autonomous takeoff is crucial for ornithopters, and ground-equipment-assisted takeoff remains necessary for those lacking self-takeoff capability. This work presents a compact ornithopter launcher with a gliding takeoff control method to address the bulkiness and terrain limitations of existing takeoff-assist devices. We optimized the launching mechanism and clarified the operation stages. A dynamic model based on the blade-element method is developed to depict the ejecting, gliding, and wing-flapping motions, enabling takeoff simulations. The control scheme incorporates smooth flight stage transitions and closed-loop flight control. Experiments show that the 250-g ornithopter reached an initial speed of 4.0 m/s with a 15% improvement in energy efficiency, demonstrating a 12/12 success rate in indoor tests. We also validated 309-g heavy-load takeoff and outdoor tests under Beaufort force 2 winds and on rough ground. Conclusively, this work presents a feasible solution for the takeoff challenge of ornithopters.
Learning safe and stable robot motions from demonstrations remains a challenge, especially in complex, nonlinear tasks involving dynamic, obstacle-rich environments. In this letter, we propose Safe and Stable Neural Network Dynamical Systems S$^{2}$-NNDS, a learning-from-demonstration framework that simultaneously learns expressive neural dynamical systems alongside neural Lyapunov stability and barrier safety certificates. Unlike traditional approaches with restrictive polynomial parameterizations, S$^{2}$-NNDS leverages neural networks to capture complex robot motions, providing probabilistic guarantees through split conformal prediction in learned certificates. Experimental results in various 2D and 3D datasets—including LASA handwriting and demonstrations recorded kinesthetically from the Franka Emika Panda robot—validate the effectiveness of S$^{2}$-NNDS in learning robust, safe, and stable motions from potentially unsafe demonstrations.
Visual localization is crucial for autonomous aerial vehicles (AAVs), especially during aggressive rotational maneuvers. Existing methods—whether handcrafted-based or learning-based—often suffer from limited rotation robustness or high computational costs, hindering their use in real-time applications. To address these challenges, we propose a novel framework for learning compact, rotation-aware binary descriptors. Our approach unifies descriptor binarization, knowledge distillation, and rotation-equivariant representation within a joint optimization formulation. We recast the objective into a Semi-Orthogonal Procrustes Problem, thereby converting the binary descriptor learning process into a classification task via pseudo-label generation. A dual-softmax loss is further employed to improve distinctiveness and geometric consistency. Extensive evaluations demonstrate that the method achieves highly competitive matching accuracy under large rotations while maintaining low computational cost and storage overhead. Thereby, it offers an efficient and practical solution for real-time visual localization in resource-constrained systems such as AAVs.
Global localization is critical for autonomous navigation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the correlation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, occlusions, and perceptual aliasing in homogeneous environments — known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and tracking pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consistent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory-conservative baseline, making our approach capable of real-time implementation on resource-constrained platforms.
Robotic manipulators hold significant untapped potential for manufacturing industries, particularly when deployed in multi-robot configurations that can enhance resource utilization, increase throughput, and reduce costs. However, industrial manipulators typically operate in isolated one-robot, one-machine setups, limiting both utilization and scalability. Even mobile robot implementations generally rely on centralized architectures, creating vulnerability to single points of failure and requiring robust communication infrastructure. This paper introduces SMAPPO (Scalable Multi-Agent Proximal Policy Optimization), a scalable input-size invariant multi-agent reinforcement learning model for decentralized multi-robot management in industrial environments. MAPPO (Multi-Agent Proximal Policy Optimization) represents the current state-of-the-art approach. We optimized an existing simulator to handle complex multi-agent reinforcement learning scenarios and designed a new multi-machine tending scenario for evaluation. Our novel observation encoder enables SMAPPO to handle varying numbers of agents, machines, and storage areas with minimal or no retraining. Results demonstrate SMAPPO’s superior performance compared to the state-of-the-art MAPPO across multiple conditions: full retraining (up to 61% improvement), curriculum learning (up to 45% increased productivity and up to 49% fewer collisions), zero-shot generalization to significantly different scale scenarios (up to 272% better performance without retraining), and adaptability under extremely low initial training (up to 100% increase in parts delivery).
Sperm DNA integrity indicates genetic quality and stability, which is vital for successful fertilization, embryo development, and the overall health of offspring. Currently, DNA integrity is assessed manually by clinical staff through invasive staining procedures, which are labor-intensive and alter sperm viability. In this work, we propose a deep learning-based approach to automatically predict DNA integrity in live sperm without manual operation or invasive staining. Specifically, the proposed method performs pixel-wise prediction of ssDNA and dsDNA for accurate DNA Fragmentation Index (DFI) calculation. A coupling network is introduced to improve both prediction accuracy and interpretability. To effectively capture both local and global features, the enhancement model integrates convolutional operations into the Transformer architecture. Additionally, a cross-scale fusion Transformer block is designed to fuse multi-scale information using a cross-attention mechanism. To further enhance prediction accuracy, a high-frequency branch is incorporated to guide the network's focus toward fine-grained, high-frequency details. Experimental results on a sperm DFI dataset demonstrate that the proposed method outperforms existing enhancement networks.
Voxel-based LiDAR–inertial odometry (LIO) is accurate and efficient but can suffer from geometric inconsistencies when single-Gaussian voxel models indiscriminately merge observations from conflicting viewpoints. To address this limitation, we propose Azimuth-LIO, a robust voxel-based LIO framework that leverages azimuth-aware voxelization and probabilistic fusion. Instead of using a single distribution per voxel, we discretize each voxel into azimuth-sectorized substructures, each modeled by an anisotropic 3D Gaussian to preserve viewpoint-specific spatial features and uncertainties. We further introduce a direction-weighted distribution-to-distribution registration metric to adaptively quantify the contributions of different azimuth sectors, followed by a Bayesian fusion framework that exploits these confidence weights to ensure azimuth-consistent map updates. The performance and efficiency of the proposed method are evaluated on public benchmarks including the M2DGR, MCD, and SubT-MRS datasets, demonstrating superior accuracy and robustness compared to existing voxel-based algorithms.
This paper proposes DiffPF, a differentiable particle filter that leverages diffusion models for state estimation in dynamic systems. Unlike conventional differentiable particle filters, which require importance weighting and typically rely on predefined or low-capacity proposal distributions, DiffPF learns a flexible posterior sampler by conditioning a diffusion model on predicted particles and the current observation. This enables accurate, equally-weighted sampling from complex, high-dimensional, and multimodal filtering distributions. We evaluate DiffPF across a range of scenarios, including both unimodal and highly multimodal distributions, and test it on simulated as well as real-world tasks, where it consistently outperforms existing filtering baselines. In particular, DiffPF achieves a 90.3% improvement in estimation accuracy on a highly multimodal global localization benchmark, and a nearly 50% improvement on the real-world robotic manipulation benchmark, compared to state-of-the-art differentiable filters. To the best of our knowledge, DiffPF is the first method to integrate conditional diffusion models into particle filtering, enabling high-quality posterior sampling that produces more informative particles and significantly improves state estimation. The code is available at https://github.com/ZiyuNUS/DiffPF.