Recognising Sounds – Knowing What we are Hearing

by mjkmercer

Note- See my Essay – ‘Sounds, Location and perception’ which is a prelude to this and covers the journey of a wave and its transformations from source to the ear. 

Once the ear has received the complex array of waves (now collected at the ear canal as a pulsating tube of air) it falls to the rest of the auditory system to ‘work out’ the content and its meanings and to identify which components go together to form each discrete auditory stream. The apparatus now involves the brain and its experience of the world to carry out this almost instantaneous sifting – first to detect threat (which would make us jump before we knew why) then to look for messages with meaning for us – that we need to respond to then by focusing and making a decision to do so, hearing all the background sounds. This is Auditory Scene Analysis.

The Oncoming Wave

As I sit writing this, I am aware of the following:

There is the slightest whisper of short strokes of my pen like silk on paper; the paper emits a tiny hollowness as the nib touches the surface and behaves like a membrane (I’m resting on a pad not a desk). There is also the soft breath of my computer fan – so continuous that I usually don’t hear it. Outside (I know it is outside because of my location perception apparatus) a bird unwinds its minimalist song. I identify what sort of bird it is. If I could, the experience would be different. It’s sound stream would be that of an ‘x’ and would form an impression complete with a label and my familiarity might downgrade the event to the level of the computer fan. But being a musician I enjoy the tune and it is something I have heard before and recognise.  This is a special sort of sound stream in that it could be interrupted and I would still know what was coming next. Consider the inevitability of each note of a wood pigeon’s call. I’ll come back to this point about known sounds from  knowledge templates versus unknown sounds which become categorized by extrapolation from templates of similar sounds.


Theory of Forms and Sound Recognition Templates


I need to digress into Plato’s Theory of Forms briefly – so here is an early philosophy warning.

Sounds have form that exists in time and space. The sound of a car going by is such an event: it is a unique event yet totally familiar. Were we to record it and look at the waveform in great detail we would know in keeping with our knowledge of things like snowflakes, that we will never see the same wave form exactly the same ever again and yet there will be plenty of cars passing by. Each event has sufficient characteristics in common for us to be able to recognise the sound and label it because we are able to access the idea of a car sound in our minds and instantaneously realise that it conforms to that general form.

The sound recognition templates (forms) that we have in our minds can only be broad in scope. We cannot hope to match a sound identically – it might appear at a different distance, be a different car, different surrounding acoustic, we might have the window open or closed. So many factors guarantee that we  almost never will hear the same sound twice.

We recognise the sound of a violin easily and based on our wide experiences of violins we are able to get a close match to the templates we have. Were we to experience an er-hu (Chinese two-string violin) having never heard one before, our cognitive processes will go into overdrive looking for a match.  The mind will offer the template of a violin to our understanding, but it will also inform us that it was not a veridical match. That something was different about the sound that defied the quick labelling. We might choose then to focus on the sound to describe to our inner process a wider experience of the sound. On being told what it is, we then understand what was heard, what the name is and we file away a template for the future. (Imagine if we described an er-hu as being a little like a crude violin – how that might change our experience of ‘World Music’). However, this is only one instance of an er-hu so far. If, in future, we heard a very similar sound it might not match the template that was formed with one instance alone. Perhaps we hear the er-hu in a different context – say in a Chinese orchestra and the mind struggles because it has a hunch about what it is hearing and we might look at the programme and find the word. Then we can re-categorize the template with two instances attached to it – this is a gathering experience which, in time, will form a life of experience and cause us to say – ‘yes, we know what an er-hu is,’ without us having to go through the room marked  ‘violin samples.’

This is like Plato’s forms or to be more up to date and put a similar spin on it (perhaps these different theories of forms aspire to an ideal theory of forms too?) I’ll cite Wittgenstein’s ‘family resemblances’ (In the next draft of this I’ll find the necessary quotes and references to keep us all connected to the world of ‘stuff already said by others’)


Auditory Scene Analysis


I’m still in my study, listening…The bird continues to make its sound. But how do I know it is one bird and not two? It might well be of course – I cannot know each bird so intimately as to be able to distinguish individual sounds – (though the bird probably can).  But I can extrapolate from the incoming sound wave a single thread of sound.  This is where I either recount everything Albert Bergman wrote or just send you off to read his work (See bibliography below)  but to save time here is a rapidly and easily digested summary from  Wiki….


What we learn from this? That the auditory system has powerful tools to sort and label and understand streams even when they appear mixed. There is intelligence at work and it may give you a small thrill of pleasure to know that even a powerful computer struggles to do this – but hold on to you humour because they are coming…


The Sum of all Sounds


More sounds are layered onto each other in my room: Vehicles, voices, violinists (daughter at practice) and so on. All that I can hear is to some extent familiar but in another aspect unique. Similar though they may be I have never heard them in this context (this mix of other sounds) at this distance, or this room  reverberation , or with this  physical make up (my ears are particularly acute today)  and so on.


To sum the listening experience of this moment;

1          The low hum of large vehicle

2          The bird sound

3          The sound of the pen on paper

4          The fan in the computer

5          Murmuring female voices somewhere – indiscernible words

6          Slight creak form my old faithful chair

7          Distant violin music being played

8          Clock softly ‘chucking’ (It’s a deeper sound than a tick)

…I shall pause and sit still in contemplative listening for a moment….


9          A very distant jet somewhere high above

10        A dog barked.


In the moment of the dog barking, all the other sounds became masked but I know they had not stopped and sure enough they re-appeared a moment later when I widened my attention again.  Masking and the illusion of continuity behind the masking is another piece of lovely hearing perception theory, I am very keen on Brian Moore’s book (see Bibliography)


My point – all this is happening on a quiet morning early,  before the day has really got going.  All these sounds are there to be picked out of the incoming stream and labelled. And I know what each one was. Being of a musical disposition, if I heard something I could not label, I would normally feel compelled to investigate – especially if I could hear musical potential.


So all these sounds are conforming to the templates I have for them and they have sorted themselves out in labelled streams I do not confuse the hum of the lorry with the distant music – though I could imagine that that might happen in some circumstances. As a composer I could make all sorts of things come together in the studio through careful mixing, but at the moment the sounds are from different directions, different distance and they are therefore not confused.

This single ‘mix’ of the moment ‘now’ (surely grounds for a Cagean aleatory composition?) presents a single vibrating pipe of air to the eardrum and the mind sorts it out for us and labels each experience with word or familiar feelings. It still also works to identify the location of the source and so on. The interesting thing is that I seem only to be able to give my attention to one at a time or I can chose to let the whole unlabelled sound wash over me as if it were a single source and this ability we have is important because of the implications for music.


Musical Implications


My ears have to work out how to hear an orchestra. I can, of course, hear the different parts that make up the sound I hear. But is that really so?  What I actually hear is groups. I cannot, for example, discern individual violins (unless one is particularly loud or bad) but I can pick out the flute. But then the flute is joined in unison by the oboe forming a new single sounding texture. The two combine because they are part of the same auditory scene, their timing events are identical and their pitches change in unison. A good example of this blending of different sounds is the way in which organists create sounds by mixing and layering different sets of pipes in ‘registrations’.  We hear a single event or single event texture. We hear a single violin line – unless suddenly the leader plays a melody above the texture.

This auditory scene analysis is critical to our understanding of how we hear music and how therefore we are going to record it.


When we listen to music we hear the whole thing or we chose to attend to parts. (Listening to a fugue on the piano is a supreme example of this and the best advice anybody ever gave me for listening to a fugue is to ‘go with the flow’).


Separating sounds out for us – such that we do not wrongly mix them up are:


  • Pitch differences and  similarities
  • Timing differences and similarities
  • Following or not following a pattern (sequence of timed events like ticking clock)
  • Timbre
  • Location or apparent source
  • Event (part of the show/not part of the show)
  • Visual information about the sources


These factors help us to match different parts of the sound to templates to be able to recognise each as a separate element.

Sorting out Source Location

I wrote previously about how a sound accrues information as it travels to the ear concerning the source location in relation to the listener. The listener also has to process that data at the same time as carrying out the scene analysis.

In the same way that we have templates to recognise sound types and patterns, I suggest that we have, a template or model of the world that helps us sort out source locations. I am not sure if I should try to confound things by suggesting that each sound template that guides our recognition holds all the possible variants of location within  – that seems an inefficient way for the brain to do things. I am going to suggest (and try to research further) that localisation (spatial) processing is from a different set of templates to those we use to identify the sound itself. Some of those templates for location might be linked to the fight or flight sound identifiers and cause rapid alarm in us primitive beings who still want to jump and run if we hear something like a wasp approaching.


As I mentioned earlier, at the eardrum, there is only a vibrating column of air. (I will keep it simple but I am aware that there is other information available through the vibrations coming to the back of the eardrum through the head and through information picked up through bone vibration etc.) The intelligent ear has a means of assessing how far away a sound is. Sound changes with distance and it changes in level, reverberation and tone.  When we hear a sound thus modified we know immediately that we are hearing, say, a trumpet at a distance rather than one close to, and processed. (This we can register from a microphone and thus filter out extraneous information (for now) about its general direction.


But what if we were to do just that – take a close-up trumpet sound and drop it to the rear of a mix. This happens all the time in the studio and good engineers know it is not just volume and reverb but careful adjustment of the EQ that gives the desired result. The level of verisimilitude seems linked closely to the engineers understanding of sound propagation. In my experience engineers brought up in the ‘hands-on’ school of mixing – or  ‘do it the same way as Bob’ as apprentices soon find out, are lost when it comes to working in these more subtle ways. The rock techniques for  moving a sound to the rear will not however work very well when trying to creates a realistic soundstage that  contains  reproducible  distance information in particular and location information in general.


So we have in us, an innate ability to assess how far away a sound it. This must be informed by knowledge about where we are. We will know that we are in a cathedral or outdoors for example. Experience tells us something about what to expect and how a sound will behave in these environments.


Recreating The Experience


The problem for recording engineers is to recreate what the ear has heard convincingly. There is not space here for a general review of stereo and  multi-channel techniques (I’ll write it soon though).

There is much that can be improved in the stereo recording sand mixing process by understanding how sound gets to the sentient mind, and much that we can design as a solution to improve that.  It is vital to understand that:


At the point where the microphone receives the sound most of the location information will not be recorded. It will give some distance information, it will give an approximate direction but it will not pick up what the eardrum receives.  Were we to insert microphone capsules in the ear and record the sound that gets to the drum we might have more information to work with but because of  Head related Transfer functions, the sound will present uniquely to the individual whose head is being measured. It was in the hope of getting round all these problems that binaural recording was invented – placing microphones in a dummy head to mimic the way our own heads work.


More soon.


MjkM August 2013