The Search - SPORCH 0.2.2-alpha

Next: SPORCH Examples, Previous: The Instrument Database, Up: How it Works

2.2 The Search

Once the database is built, the user must provide an orchestra definition file to tell SPORCH which combinations of instruments it can use. Any sound recording can then be analyzed and orchestrated by executing SPORCH's search engine and specifying the source sound file. When finished, the program outputs the results of its analysis, either a text printout or a C data structure depending on how the program is used (as the sporch command-line program or by loading libsporch.so and calling its API directly). This may then be resynthesized for verification or passed to other functions for refinement and output to a notation program.

The orchestration algorithm works as follows:

The sound source file is read and analyzed for prominent peaks in its frequency spectrum (via the procedures described above).
The program iterates through every possible instrument, technique, pitch, and dynamic level and evaluates these to determine which one is the “best fit” for the given sound source. The best fit is determined using a procedure that compares the source peak data to the instrument peak data and rates the match with a numerical value. The rating represents to what degree the source spectrum can be eliminated by “subtracting” the instrument peaks from the source peaks. The following steps describe this in more detail:
- The program iterates through every peak for the instrument/pitch/dynamic level currently being evaluated.
- An attempt is made to match each peak with a corresponding peak in the source sound. A match is made if the peaks occur within a certain error margin or frequency range. The error margin is half of a semitone if chords based on the semitone scale are specified and half of a quartertone if chords based on the quartertone chromatic scale are specified. (Other tunings may be defined, in which case similar error margins are used.)
- If the two peaks match, the amplitude value of the instrument peak is subtracted from the amplitude of the matching source peak. Peaks with negative amplitudes may result, which contribute by increasing the amount of error.
- If there isn't a match, a new “peak” is created in the source sound data with the same frequency as the instrument peak. The amplitude of the new peak is the negative value of the instrument peak's amplitude. (In other words, a new “negative” peak is created.)
- After iterating through each instrument peak, the sum of the squares of each peak in the source is calculated to determine a value that is roughly the total amplitude of the source's spectrum. Squaring the peaks gives more importance to higher values. Both positive and negative peak amplitude values contribute to this sum, so that negative peaks effectively increase final score.
- A lower value resulting from this calculation indicates a better fit. If the two spectra are nearly equivalent, the result is close to zero. If they are completely different, the result is a larger number. This number might be even greater than the original starting point if nothing at all was subtracted from the source data (and only negative peaks were created), in which case the particular instrument, pitch and dynamic level combination is completely discarded as a candidate and the program continues to search through other possibilities.
- If no instruments are able to decrease the spectrum by any amount, the program ends its search and returns the results of its analysis.
- If the above is not the case, the program adds the best fit to its list and replaces the source spectrum with the “subtracted” one used in the above evaluation. All subsequent evaluations then use this data to determine the next best fit. The iteration continues until either no instrument is found that decreases the source data or all of the instrument combinations have been used up.

The algorithm essentially finds instrument/pitch/dynamic level tuples whose frequency contents add together to form a composite spectrum that has some crude resemblance to that of the source. The amount of similarity varies depending on the source used and the instrumentation specified. In general, sound sources with a strong pitched element produce orchestrations that sound relatively close with respect to pitch and timbre. Sound sources that contain noise, however, are also useful when matching them with pitched instruments—the algorithm attempts to approximate the noise by selecting a somewhat random but biased collection of notes.

When developing the application, several different methods of peak matching (or different types of error margins) were tried. The ones originally expected to give the best results were based either on the frequency discrimination curve or critical bandwidth curves—in other words, the error margin changed depending on the frequency values under consideration. The results were much worse in quality than when the static error margin described above was used, the most obvious discrepancy being a difference in perceived pitch. Since the results were chords made of pitches quantized to either 12 or 24 equal divisions of the octave, any difference of more than a semitone or quartertone between the most prominent pitch components was heard simply as a different pitch. The conclusion was that when forming chords based on semitone divisions of the octave, the error margin must be half of a semitone distance (half of the distance between neighboring pitches). For quartertone scales the distance is half of a quartertone.

The procedure outlined above is executed when the software is run at its lowest “search depth” setting. When set to a higher level, the algorithm may split the search at any point, considering the best two or three instrument choices rather than just selecting one. Multiple search paths are opened up only if their scores are close to each other (within some threshold level). The higher the search depth setting the more exhaustive the search, the highest setting being a complete search through every possible combination. This heuristic causes the algorithm to increase the search only when choices are relatively close.

SPORCH also assigns numerical values to each of its matches which may be interpreted as a confidence value or a rating of the contribution of the instrument to the total match. This value is the relative amount that the instrument/pitch has subtracted from the original in terms of the same scoring procedure described above. It is somewhat related to the dynamic level chosen and is useful for determining which instrument/pitch tuples are the most important. A single, final confidence value is also output showing the total percentage amount of subtraction that was done on the source. Informal listening tests have shown that the value is useful in terms of estimating whether the result is similar with respect to timbre but not necessarily with respect to pitch.

Although the spectrum of the orchestration typically contains many more prominent frequency components than that of the source, most of the original partials are present. The energy of the sound in both examples is also usually concentrated in the same frequency areas. When comparing the two aurally, the timbre of the orchestrated sound is very similar to the timbre of the original, given the fact that the texture of the two sounds are most likely completely different. The subtraction procedure then accomplishes at least two things that are significant in matching the timbre of the original sound:

Either the highest peaks or sometimes groups of three or more peaks that fall into harmonic intervals are matched first. This is important for matching the pitch of the sound—that is, the fundamentals and lower harmonics of each instrument/pitch tend to add together to recreate the perceived pitch of the sound. The fundamentals (or octave transpositions in some cases) of the instruments either match the highest peaks or create a few new ones that may be implied (as virtual harmonics) but aren't present in the original sound. These initial picks are usually accompanied by high dynamic markings and high contribution ratings.
The remaining picks usually have low dynamic levels and contribute by shaping out the rest of the spectral envelope, filling in the places where more energy is needed. These notes are often significantly higher or lower than the initial picks, depending on the nature of the source sound. The final confidence value tends to reflect how well these later choices contribute to the overall timbre.