Hear and Harvest: Robots Use SonicBoom to Pick Produce

CMU’s SonicBoom uses contact microphones to help robots detect and harvest fruit in unstructured farm environments where vision-based systems struggle.


News August 25, 2025 by Stephanie Leonida

Researchers from Carnegie Mellon University’s Robotics Institute are developing a sound-mediated robotic technology, SonicBoom, that uses contact microphones housed within a boom mic-like robotic end effector to localize objects in unstructured environments (such as a leafy, unstructured apple tree or raspberry bush). This pioneering technology may evolve robotic perception in a new way, focusing on sound, not just visual input, and so, offering a potential solution for challenging farming tasks.

 

SonicBoom employs touch-based sensing and vibration detection to locate rigid objects.

SonicBoom employs touch-based sensing and vibration detection to locate rigid objects. Image used courtesy of CMU

 

The Labor Shortage in Agriculture

The agricultural industry is facing a critical shortage of skilled labor. Policy changes in migration, training and education, rural/regional development, and labor markets have compounded this issue. In the U.S., the tightening of immigration policies, rising operational costs, an ageing farming workforce, and the challenges of onboarding advanced automated farming machinery have contributed to the problem. The significant cuts to the migrant workforce have hindered harvest operations, including fruit/vegetable picking. Farmers have suffered from downturns in productivity and yield, with fields going partially or completely unharvested, food waste increasing from unpicked produce, and the quality of specialty foods requiring higher labor (such as strawberries and avocados) declining.

Farmers and growers are turning to more advanced machinery and systems to help combat the labor shortage, employing vertical farming methods, robotic transport systems, soft-fruit picking robotic systems, and other automated harvester systems to augment human labor. One such soft-fruit-picking technology is the Fieldworker 1 autonomous harvesting robot (from Fieldworks Robotics). This bot employs tailored AI algorithms and an advanced machine vision system to pick berries at the right ripeness. Only a single operator is needed across multiple bots, reducing labor costs and labor intensity, and removing human bias (concerning ripeness) to optimize fruit quality.

 

Using Sound to Detect and Pick Food

Modern agricultural robotics technologies designed for soft to hard fruit and vegetable harvesting might use vision-based systems with fragile sensors and bulky camera setups. One challenge for agri-minded roboticists is building a sophisticated fruit/veg-picking robotic system that employs durable sensors and dexterity for complex picking in unstructured environments. In an orchard or in a field of raspberry bushes, leaves and branches can obscure visual data and make it difficult for robotic systems to sense and pick target consumables.

A team of collaborative researchers from Carnegie Mellon University’s Robotics Institute has created an innovative, unique technology that uses contact, instead of conventional microphones (which pick up sound vibrations through air), to sense audio vibrations when interacting with objects.

 

SonicBoom uses sound to detect and pick fruit by using contact rather than microphones to detect objects. Video used courtesy of CMU’s Robotics Institute

 

Essentially, the novel SonicBoom hardware features a PVC tube that looks like a boom mic, which has two rings (each fitted with three contact microphones) adhered to the walls of the tube. The microphone-attached rings are fitted at one end of the tube and the other to provide coverage for any contact and subsequent vibrations.

How did the researchers account for distortions in vibrations picked up by the contact sensors resulting from various plastic and metal robotic parts? They employed a learning-based and feature-engineering methodology. The researchers designed SonicBoom to incorporate a learning model trained on three key parameters: GCC-PHAT (generalized cross-correlation with phase transform), Mel spectrograms, and robot proprioception (representing the robot’s sense/understanding of its own movement).

Mel spectrograms help record the energy changes of vibrations across frequencies as time passes, which aids in determining the signal's general shape. However, spectrograms cannot be used in isolation, as they cannot identify which microphone detected the sound initially. GCC-PHAT, a technique that analyses microphone pairs to determine minute pauses between signals, provides that timing information. GCC-PHAT is the primary method for predicting direction since it relies on phase differences rather than volume, making it dependable irrespective of noisy or echo-prone environments.

Robot proprioception—the ability to recognise its own motion—is the third component. The robot is cognizant of its recent trajectory (position and velocity over one second) and is likely to collide with objects in the direction it is moving. SonicBoom learns to identify the location of a sound or disturbance to the robot by combining these three components: proprioception to offer context, GCC-PHAT to quantify when and where the vibration was sensed, and spectrograms to illustrate the vibration.

 

Summary

SonicBoom can determine the source of a collision or vibration: the type of vibration, the direction the robot was travelling, and the time each microphone detected it. Combining these enables precise impact localisation, despite the chaotic and erratic vibrations incurred by a robot's hardware. This technology could change how agri-robots sense their environment, much like if someone is blindfolded, the senses of touch, sound, and smell become heightened. Essentially, the researchers behind this innovative tech use sound as another means of perception, not solely relying on visual perception, which can limit robotic perception in unstructured environments.