Musical notes recognition using artificial neural networks.
Artificial neural networks have known until now periods with extreme activity and periods with disappointing results. It seems that the first decade of 21st century is a period in which research focuses more on practical applications in very diverse areas. Starting with Hermann von Helmohltz in 1869 and Pavlov (Pavlov, 1927) who developed theories about learning, going on with Hebb (Hebb, 1961) who enunciated the principle of synaptic plasticity, to Kohonen (Kohonen, 1995) and Hopfield (Hopfield, 1982) who developed new structures and training methods, all periods have known practical applications for artificial neural networks. In the last decade, the main domains in which artificial neural networks proved their utility and efficiency are functions approximations, data classifying, pattern recognition, shape recognition, vocal identification, industrial process control, robotics, and financial prediction. In (Yiadid-Pecht et al, 1996) musical notes are recognized using a modified Neocognition model while the method described here uses feed forward neural networks. This paper can be included in the area of pattern recognition and automatic image to sound conversion. The scales are relatively simple and the notes taken into consideration are full, half, quarter and eight.
2. NOTE FEATURES
The authors suggest the following phases to solve the proposed problem: acquiring an image using a web camera; identifying the stave lines of the current scale and deleting them from the image; identifying the properties of each note and erasing the current note; identifying the notes by using the procedural method; exporting the characteristics to the neural network; training the net using the training set; testing the network; displaying the notes and playing the scale. Finding the characteristics is the main processing step and it has as an objective identifying the properties of each musical note. This algorithm includes the image pre-processing and extracting the note properties. This step is important because it does not contain redundancy elements and if the step of extracting the note properties fails, the whole program will be affected.
Input data. The program accepts as input data an image that will be processed to obtain the characteristics. In the image, some noise pixels could be present, stem, and flag may not be "standard" and stave lines may not be exactly parallel and the note is green. The image has to have the following properties: to have a five line stave, there are not overlapped notes (or two, three voices), the distance between two consecutive notes is at least one note (or, it can be setup in the program).
Image conversion to binary (bitmap). The first step after acquiring the image is to obtain the black and white version of it. The conversion is made as follows: read each pixel of the image on rows and columns and identify the local color levels (red - R, green - G, blue - B). Then, convert the color to grayscale according to the effective luminance of a pixel formula (Moise, 2005):
Y = 0.3 x R + 0.59 x G + 0.11 x B. (1)
Then, the grayscale image is converted to binary using a threshold procedure. After this step, the scale is converted to black and white.
Noise rejection. When images are captured, there will be noise pixels or groups of pixels due to scale light irregularities or incorrect conversion. The noise rejection function eliminates all the black pixels that have 4-connected and 8-connected white pixel neighbors.
Stave lines. After the noise rejection is done, the next step is enclosing the image into an area bordered by two vertical lines. That means, all the lines will have the same length after the procedure is applied. The authors called this process identifying the start and stop points. The algorithm is the following: read the image columns upwards from the lower left corner. When five consecutive transitions from 1 to 0 and five consecutive transitions from 0 to 1 will be found, the corresponding vertical line will be considered. Similarly, to get the left-hand sideline, the columns will be read from the lower right. After applying these procedures, the original image will be bordered by the two lines just found, as in Fig. 1.
Identifying characteristics. To identify a note one should find some characteristics that define the note. One of them is the stave line that is a relationship with the note. There are two kind of relationships that can exist between notes and lines: the note is on the line n or under the line n. The note head gives another characteristic: full head or empty head. The flag gives the third characteristic: note with or without flag.
[FIGURE 1 OMITTED]
In order to find the line interacting to the note, one should identify the following points: the left end, the bottom end, the right end, and the upper end of the note. To find the left end of a note, the image is read starting with the lower left corner, on columns, until a 1 (black) pixel is found. Because there is a possibility to be many pixels (a segment) on the left end of the note, the pixel in the middle of the segment is kept. To find the bottom end of a note, the image is read starting from the left end to the right, row by row. The reading ends when the y coordinate of a pixel is smaller than the previous y coordinate. The relative center of the note is found by using the following reasoning. Since we have the left end (x _ s, y _ s) and the bottom end (x _ j, y _ j), the center of the note will have the coordinates (x _ j, y _ s). By finding the center of the note, one can identify if the note head is full or empty: if the central pixel has the value 1, the note has a full had, otherwise the head is empty. The upper end and the right end of the note can be found by reading the image starting from the center of the note in a vertical direction (for the upper end) and in a horizontal direction (for the right end).
The flag and the stem. To find the length of a note (stem and flag) the image is read according to the representation in Fig. 2. When such an image is read, positive edges (changes from white to black) and negative edges are found. If in the end of the reading the maximum number of positive edges followed by negative edges equals 2, then the note has a flag, if it equals 1, the note has only a stem.
The characteristics will be converted into binary as follows. The first 7 bites represent the line number which interacts with the note. For example, if the note interacts with the line 3, then the binary number will be 0001000. Bit 8 represents the note position against the interacting line. The value is 1 if the note is on the line, or 0 if the note is under the line. Bit 9 represents the head of the note. It is 1 if the note has full head, 0 if it has an empty head. Bit 10 represents the stem. It is 1 if the note has a stem, 0 if it has not. Bit 11 represents the flag. It is 1 if the note has a flag, 0 if it has not. The characteristics for the note in the example above are 00010001111.
3. NETWORK DESIGN
The 11 features mentioned above will be inputs for a totally interconnected neural net, which has 11 neurons on the input layer, a hidden layer with 100 neurons and 2 neurons on the output layer. The activation function for the neurons in the output layer was the linear function. The back propagation algorithm (Chauvin & Rumelhart, 1995) was used to train the network and a fragment of the training set is shown in Tab. 1. For example, 1000000 represents the note DO. Le is the note length, L1, ..., L7 are the lines of the stave, Full means full head, Stem indicates the existance of a stem.
Output t1 indicates the note number and the output t2 gives the note length (a real number 1 for a whole note, 0.5 for a half note, 0.25 for a quarter note or 0.125 for an eighth note). The training algorithm was used to train the network with different activation functions.
[FIGURE 2 OMITTED]
Twelve different activation functions have been used for the hidden layer neurons and only six ended the training process. They were sigmoid, radial basis, hyperbolic tangent, triangular basis, linear saturation, and linear symmetrical saturation. The triangular basis was the fastest and the sigmoid was the slowest. After comparing the results with different activation functions, different algorithms were taken into consideration. The fastest algorithm was traingdx (gradient descent momentum and variable learning rate) and the slowest was (gradient descent with variable learning rate).
Playing a scale. The above-described method was used to play the two scales with 4 and 3 notes. The results obtained after training the network are shown in Tab. 2. One can see that errors are in the range of 0.01 and the results are very accurate.
The authors developed an application for musical notes recognition and playing scales directly from a video image. The main contributions of the paper are the algorithms developed to find the 11 features of the notes. These features have been used as inputs for a feed forward net. One problem when using this method is the capture and scale binary representation in the computer memory. Although the scale is ideal, that means the stave lines are parallel, this scale will not be the same in memory. This work can be continued by considering other artificial structures that could be better used for recognizing and playing more difficult scales with much more stiles of notes.
Chauvin, Y., Rumelhart, D. E. (1995). Backpropagation: Theory, Architecture, and Applications, Lawrence Erlbaum, ISBN 0-8058-1259-8, Hillsdale, New Jersey, U.S.A.
Hebb, D.O. (1961). Distinctive features of learning in the Higher animal, Oxford University Press, London, England
Hopfield, J.J. (1982). Neural networks and physical systems with emergent collective computational abilities, Proceedings of the National Academy of Sciences of the U.S.A., vol. 79 no. 8, pp. 2554-2558, U.S.A., April 1982.
Kohonen, T. (1995). Self-Organizing Maps, Springer, Vol. 30, ISBN 3-540-67921-9, Berlin, Germany
Moise, A.(2005). Neural networks for pattern recognition, MatrixRom Ed, ISBN 973-685-904-5, Bucharest, Romania
Pavlov, I. P. (1927). Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex, Translated and Edited by G. V. Anrep, Oxford University Press, London, England
Yadid-Pecht, O., Gerner, M., Ddvir, L., Brutman, E, & Shimony, U. (1996). Recognition of handwritten musical notes by a modified Neocognitron. Machine vision and applications, vol.9, no. 2, pp. 65-72, ISSN 0932-8092, Springer, Germany
Tab. 1. Identifying the length of a note Le L1 ... L7 Pos Full Stem Flag t1 t2 full 1 0 0 1 0 0 0 1 1 Tab. 2. The network output after training Note no. Note value Length Note 1 4.99868--sol 0.119873--eighth Note 2 1.99765--re 0.131964--eighth Note 3 2.00323--re 0.229565--quarter Note 4 4.98409--sol 0.253838--quarter Note 5 2.99876--mi 0.122752--eighth Note 6 2.99876--mi 0.122752--eighth Note 7 2.99876--mi 0.494526--half
|Printer friendly Cite/link Email Feedback|
|Author:||Moise, Adrian; Constantin, Adrian; Bucur, Gabriela|
|Publication:||Annals of DAAAM & Proceedings|
|Date:||Jan 1, 2009|
|Previous Article:||Achieving an excellent combination of mechanical properties in multiphase steels by controlled development of microstructure.|
|Next Article:||Dialectics and efficiency of control and executive levels in automated systems.|