‏إظهار الرسائل ذات التسميات Machine Learning. إظهار كافة الرسائل
‏إظهار الرسائل ذات التسميات Machine Learning. إظهار كافة الرسائل

الثلاثاء، 26 نوفمبر 2013

Released Data Set: Features Extracted From YouTube Videos for Multiview Learning


“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”

Performance of machine learning algorithms, supervised or unsupervised, is often significantly enhanced when a variety of feature families, or multiple views of the data, are available. For example, in the case of web pages, one feature family can be based on the words appearing on the page, and another can be based on the URLs and related connectivity properties. Similarly, videos contain both audio and visual signals where in turn each modality is analyzed in a variety of ways. For instance, the visual stream can be analyzed based on the color and edge distribution, texture, motion, object types, and so on. YouTube videos are also associated with textual information (title, tags, comments, etc.). Each feature family complements others in providing predictive signals to accomplish a prediction or classification task, for example, in automatically classifying videos into subject areas such as sports, music, comedy, games, and so on.

We have released a dataset of over 100k feature vectors extracted from public YouTube videos. These videos are labeled by one of 30 classes, each class corresponding to a video game (with some amount of class noise): each video shows a gameplay of a video game, for teaching purposes for example. Each instance (video) is described by three feature families (textual, visual, and auditory), and each family is broken into subfamilies yielding up to 13 feature types per instance. Neither video identities nor class identities are released.

We hope that this dataset will be valuable for research on a variety of multiview related machine learning topics, including multiview clustering, co-training, active learning, classifier fusion and ensembles.

The data and more information can be obtained from the UCI machine learning repository (multiview video dataset), or from here.

الأربعاء، 12 يونيو 2013

Improving Photo Search: A Step Across the Semantic Gap



Last month at Google I/O, we showed a major upgrade to the photos experience: you can now easily search your own photos without having to manually label each and every one of them. This is powered by computer vision and machine learning technology, which uses the visual content of an image to generate searchable tags for photos combined with other sources like text tags and EXIF metadata to enable search across thousands of concepts like a flower, food, car, jet ski, or turtle.

For many years Google has offered Image Search over web images; however, searching across photos represents a difficult new challenge. In Image Search there are many pieces of information which can be used for ranking images, for example text from the web or the image filename. However, in the case of photos, there is typically little or no information beyond the pixels in the images themselves. This makes it harder for a computer to identify and categorize what is in a photo. There are some things a computer can do well, like recognize rigid objects and handwritten digits. For other classes of objects, this is a daunting task, because the average toddler is better at understanding what is in a photo than the world’s most powerful computers running state of the art algorithms.

This past October the state of the art seemed to move things a bit closer to toddler performance. A system which used deep learning and convolutional neural networks easily beat out more traditional approaches in the ImageNet computer vision competition designed to test image understanding. The winning team was from Professor Geoffrey Hinton’s group at the University of Toronto.

We built and trained models similar to those from the winning team using software infrastructure for training large-scale neural networks developed at Google in a group started by Jeff Dean and Andrew Ng. When we evaluated these models, we were impressed; on our test set we saw double the average precision when compared to other approaches we had tried. We knew we had found what we needed to make photo searching easier for people using Google. We acquired the rights to the technology and went full speed ahead adapting it to run at large scale on Google’s computers. We took cutting edge research straight out of an academic research lab and launched it, in just a little over six months. You can try it out at photos.google.com.

Why the success now? What is new? Some things are unchanged: we still use convolutional neural networks -- originally developed in the late 1990s by Professor Yann LeCun in the context of software for reading handwritten letters and digits. What is different is that both computers and algorithms have improved significantly. First, bigger and faster computers have made it feasible to train larger neural networks with much larger data. Ten years ago, running neural networks of this complexity would have been a momentous task even on a single image -- now we are able to run them on billions of images. Second, new training techniques have made it possible to train the large deep neural networks necessary for successful image recognition.

We feel it would be interesting to the research community to discuss some of the unique aspects of the system we built and some qualitative observations we had while testing the system.

The first is our label and training set and how it compares to that used in the ImageNet Large Scale Visual Recognition competition. Since we were working on search across photos, we needed an appropriate label set. We came up with a set of about 2000 visual classes based on the most popular labels on Google+ Photos and which also seemed to have a visual component, that a human could recognize visually. In contrast, the ImageNet competition has 1000 classes. As in ImageNet, the classes were not text strings, but are entities, in our case we use Freebase entities which form the basis of the Knowledge Graph used in Google search. An entity is a way to uniquely identify something in a language-independent way. In English when we encounter the word “jaguar”, it is hard to determine if it represents the animal or the car manufacturer. Entities assign a unique ID to each, removing that ambiguity, in this case “/m/0449p” for the former and “/m/012x34” for the latter. In order to train better classifiers we used more training images per class than ImageNet, 5000 versus 1000. Since we wanted to provide only high precision labels, we also refined the classes from our initial set of 2000 to the most precise 1100 classes for our launch.

During our development process we had many more qualitative observations we felt are worth mentioning:

1) Generalization performance. Even though there was a significant difference in visual appearance between the training and test sets, the network appeared to generalize quite well. To train the system, we used images mined from the web which did not match the typical appearance of personal photos. Images on the web are often used to illustrate a single concept and are carefully composed, so an image of a flower might only be a close up of a single flower. But personal photos are unstaged and impromptu, a photo of a flower might contain many other things in it and may not be very carefully composed. So our training set image distribution was not necessarily a good match for the distribution of images we wanted to run the system on, as the examples below illustrate. However, we found that our system trained on web images was able to generalize and perform well on photos.

A typical photo of a flower found on the web.
A typical photo of a flower found in an impromptu photo.

2) Handling of classes with multi-modal appearance. The network seemed to be able to handle classes with multimodal appearance quite well, for example the “car” class contains both exterior and interior views of the car. This was surprising because the final layer is effectively a linear classifier which creates a single dividing plane in a high dimensional space. Since it is a single plane, this type of classifier is often not very good at representing multiple very different concepts.

3) Handling abstract and generic visual concepts. The system was able to do reasonably well on classes that one would think are somewhat abstract and generic. These include "dance", "kiss", and "meal", to name a few. This was interesting because for each of these classes it did not seem that there would be any simple visual clues in the image that would make it easy to recognize this class. It would be difficult to describe them in terms of simple basic visual features like color, texture, and shape.

Photos recognized as containing a meal.
4) Reasonable errors. Unlike other systems we experimented with, the errors which we observed often seemed quite reasonable to people. The mistakes were the type that a person might make - confusing things that look similar. Some people have already noticed this, for example, mistaking a goat for a dog or a millipede for a snake. This is in contrast to other systems which often make errors which seem nonsensical to people, like mistaking a tree for a dog.

Photo of a banana slug mistaken for a snake.
Photo of a donkey mistaken for a dog.

5) Handling very specific visual classes. Some of the classes we have are very specific, like specific types of flowers, for example “hibiscus” or “dhalia”. We were surprised that the system could do well on those. To recognize specific subclasses very fine detail is often needed to differentiate between the classes. So it was surprising that a system that could do well on a full image concept like “sunsets” could also do well on very specific classes.

Photo recognized as containing a hibiscus flower.
Photo recognized as containing a dahlia flower.
Photo recognized as containing a polar bear.
Photo recognized as containing a grizzly bear.

The resulting computer vision system worked well enough to launch to people as a useful tool to help improve personal photo search, which was a big step forward. So, is computer vision solved? Not by a long shot. Have we gotten computers to see the world as well as people do? The answer is not yet, there’s still a lot of work to do, but we’re closer.

الثلاثاء، 28 أغسطس 2012

Google at UAI 2012



The conference on Uncertainty in Artificial Intelligence (UAI) is one of the premier venues for research related to probabilistic models and reasoning under uncertainty. This year's conference (the 28th) set several new records: the largest number of submissions (304 papers, last year 285), the largest number of participants (216, last year 191), the largest number of tutorials (4, last year 3), and the largest number of workshops (4, last year 1). We interpret this as a sign that the conference is growing, perhaps as part of the larger trend of increasing interest in machine learning and data analysis.

There were many interesting presentations. A couple of my favorites included:
  • "Video In Sentences Out," by Andrei Barbu et al. This demonstrated an impressive system that is able to create grammatically correct sentences describing the objects and actions occurring in a variety of different videos. 
  • "Exploiting Compositionality to Explore a Large Space of Model Structures," by Roger Grosse et al. This paper (which won the Best Student Paper Award) proposed a way to view many different latent variable models for matrix decomposition - including PCA, ICA, NMF, Co-Clustering, etc. - as special cases of a general grammar. The paper then showed ways to automatically select the right kind of model for a dataset by performing greedy search over grammar productions, combined with Bayesian inference for model fitting.

A strong theme this year was causality. In fact, we had an invited talk on the topic by Judea Pearl, winner of the 2011 Turing Award, in addition to a one-day workshop. Although causality is sometimes regarded as something of an academic curiosity, its relevance to important practical problems (e.g., to medicine, advertising, social policy, etc.) is becoming more clear. There is still a large gap between theory and practice when it comes to making causal predictions, but it was pleasing to see that researchers in the UAI community are making steady progress on this problem.

There were two presentations at UAI by Googlers. The first, "Latent Structured Ranking," by Jason Weston and John Blitzer, described an extension to a ranking model called Wsabie, that was published at ICML in 2011, and is widely used within Google. The Wsabie model embeds a pair of items (say a query and a document) into a low dimensional space, and uses distance in that space as a measure of semantic similarity. The UAI paper extends this to the setting where there are multiple candidate documents in response to a given query. In such a context, we can get improved performance by leveraging similarities between documents in the set.

The second paper by Googlers, "Hokusai - Sketching Streams in Real Time," was presented by Sergiy Matusevych, Alex Smola and Amr Ahmed. (Amr recently joined Google from Yahoo, and Alex is a visiting faculty member at Google.) This paper extends the Count-Min sketch method for storing approximate counts to the streaming context. This extension allows one to compute approximate counts of events (such as the number of visitors to a particular website) aggregated over different temporal extents. The method can also be extended to store approximate n-gram statistics in a very compact way.

In addition to these presentations, Google was involved in UAI in several other ways: I held a program co-chair position on the organizing committee, several of the referees and attendees work at Google, and Google provided some sponsorship for the conference.

Overall, this was a very successful conference, in an idyllic setting (Catalina Island, an hour off the coast of Los Angeles). We believe UAI and its techniques will grow in importance as various organizations -- including Google -- start combining structured, prior knowledge with raw, noisy unstructured data.

الأربعاء، 22 أغسطس 2012

Machine Learning Book for Students and Researchers



Our machine learning book, The Foundations of Machine Learning, is now published! The book, with authors from both Google Research and academia, covers a large variety of fundamental machine learning topics in depth, including the theoretical basis of many learning algorithms and key aspects of their applications. The material presented takes its origin in a machine learning graduate course, "Foundations of Machine Learning", taught by Mehryar Mohri over the past seven years and has considerably benefited from comments and suggestions from students and colleagues at Google.

The book can serve as a textbook for both graduate students and advanced undergraduate students and a reference manual for researchers in machine learning, statistics, and many other related areas. It includes as a supplement introductory material to topics such as linear algebra and optimization and other useful conceptual tools, as well as a large number of exercises at the end of each chapter whose full solutions are provided online.



الاثنين، 6 أغسطس 2012

Speech Recognition and Deep Learning



The New York Times recently published an article about Google’s large scale deep learning project, which learns to discover patterns in large datasets, including... cats on YouTube!

What’s the point of building a gigantic cat detector you might ask? When you combine large amounts of data, large-scale distributed computing and powerful machine learning algorithms, you can apply the technology to address a large variety of practical problems.

With the launch of the latest Android platform release, Jelly Bean, we’ve taken a significant step towards making that technology useful: when you speak to your Android phone, chances are, you are talking to a neural network trained to recognize your speech.

Using neural networks for speech recognition is nothing new: the first proofs of concept were developed in the late 1980s(1), and after what can only be described as a 20-year dry-spell, evidence that the technology could scale to modern computing resources has recently begun to emerge(2). What changed? Access to larger and larger databases of speech, advances in computing power, including GPUs and fast distributed computing clusters such as the Google Compute Engine, unveiled at Google I/O this year, and a better understanding of how to scale the algorithms to make them effective learners.

The research, which reduces the error rate by over 20%, will be presented(3) at a conference this September, but true to our philosophy of integrated research, we’re delighted to bring the bleeding edge to our users first.

--

1 Phoneme recognition using time-delay neural networks, A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K.J. Lang. IEEE Transactions on Acoustics, Speech and Signal Processing, vol.37, no.3, pp.328-339, Mar 1989.

2 Acoustic Modeling using Deep Belief Networks, A. Mohamed, G. Dahl and G. Hinton. Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing.

3 Application Of Pretrained Deep Neural Networks To Large Vocabulary Speech Recognition, N. Jaitly, P. Nguyen, A. Senior and V. Vanhoucke, Accepted for publication in the Proceedings of Interspeech 2012.

الثلاثاء، 31 يوليو 2012

Natural Language in Voice Search



On July 26 and 27, we held our eighth annual Computer Science Faculty Summit on our Mountain View Campus. During the event, we brought you a series of blog posts dedicated to sharing the Summit's talks, panels and sessions, and we continue with this glimpse into natural language in voice search. --Ed

At this year’s Faculty Summit, I had the opportunity to showcase the newest version of Google Voice Search. This version hints at how Google Search, in particular on mobile devices and by voice, will become increasingly capable of responding to natural language queries.

I first outlined the trajectory of Google Voice Search, which was initially released in 2007. Voice actions, launched in 2010 for Android devices, made it possible to control your device by speaking to it. For example, if you wanted to set your device alarm for 10:00 AM, you could say “set alarm for 10:00 AM. Label: meeting on voice actions.” To indicate the subject of the alarm, a meeting about voice actions, you would have to use the keyword “label”! Certainly not everyone would think to frame the requested action this way. What if you could speak to your device in a more natural way and have it understand you?

At last month’s Google I/O 2012, we announced a version of voice actions that supports much more natural commands. For instance, your device will now set an alarm if you say “my meeting is at 10:00 AM, remind me”. This makes even previously existing functionality, such as sending a text message or calling someone, more discoverable on the device -- that is, if you express a voice command in whatever way feels natural to you, whether it be “let David know I’ll be late via text” or “make sure I buy milk by 3 pm”, there is now a good chance that your device will respond how you anticipated it to.

I then discussed some of the possibly unexpected decisions we made when designing the system we now use for interpreting natural language queries or requests. For example, as you would expect from Google, our approach to interpreting natural language queries is data-driven and relies heavily on machine learning. In complex machine learning systems, however, it is often difficult to figure out the underlying cause for an error: after supplying them with training and test data, you merely obtain a set of metrics that hopefully give a reasonable indication about the system’s quality but they fail to provide an explanation for why a certain input lead to a given, possibly wrong output.

As a result, even understanding why some mistakes were made requires experts in the field and detailed analysis, rendering it nearly impossible to harness non-experts in analyzing and improving such systems. To avoid this, we aim to make every partial decision of the system as interpretable as possible. In many cases, any random speaker of English could look at its possibly erroneous behavior in response to some input and quickly identify the underlying issue - and in some cases even fix it!

We are especially interested in working with our academic colleagues on some of the many fascinating research and engineering challenges in building large-scale, yet interpretable natural language understanding systems and devising the machine learning algorithms this requires.

الخميس، 9 فبراير 2012

Quantifying comedy on YouTube: why the number of o’s in your LOL matter



In a previous post, we talked about quantification of musical talent using machine learning on acoustic features for YouTube Music Slam. We wondered if we could do the same for funny videos, i.e. answer questions such as: is a video funny, how funny do viewers think it is, and why is it funny? We noticed a few audiovisual patterns across comedy videos on YouTube, such as shaky camera motion or audible laughter, which we can automatically detect. While content-based features worked well for music, identifying humor based on just such features is AI-Complete. Humor preference is subjective, perhaps even more so than musical taste.

 Fortunately, at YouTube, we have more to work with. We focused on videos uploaded in the comedy category. We captured the uploader’s belief in the funniness of their video via features based on title, description and tags. Viewers’ reactions, in the form of comments, further validate a video’s comedic value. To this end we computed more text features based on words associated with amusement in comments. These included (a) sounds associated with laughter such as hahaha, with culture-dependent variants such as hehehe, jajaja, kekeke, (b) web acronyms such as lol, lmao, rofl, (c) funny and synonyms of funny, and (d) emoticons such as :), ;-), xP. We then trained classifiers to identify funny videos and then tell us why they are funny by categorizing them into genres such as “funny pets”, “spoofs or parodies”, “standup”, “pranks”, and “funny commercials”.

 Next we needed an algorithm to rank these funny videos by comedic potential, e.g. is “Charlie bit my finger” funnier than “David after dentist”? Raw viewcount on its own is insufficient as a ranking metric since it is biased by video age and exposure. We noticed that viewers emphasize their reaction to funny videos in several ways: e.g. capitalization (LOL), elongation (loooooool), repetition (lolololol), exclamation (lolllll!!!!!), and combinations thereof. If a user uses an “loooooool” vs an “loool”, does it mean they were more amused? We designed features to quantify the degree of emphasis on words associated with amusement in viewer comments. We then trained a passive-aggressive ranking algorithm using human-annotated pairwise ground truth and a combination of text and audiovisual features. Similar to Music Slam, we used this ranker to populate candidates for human voting for our Comedy Slam.

 So far, more than 75,000 people have cast more than 700,000 votes, making comedy our most popular slam category. Give it a try!

Further reading:
  1. Opinion Mining and Sentiment Analysis,” by Bo Pang and Lillian Lee. 
  2. A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews,” by Oren Tsur, Dmitry Davidov, and Ari Rappoport. 
  3. That’s What She Said: Double Entendre Identification,” by Chloe Kiddon and Yuriy Brun.

الأربعاء، 8 يونيو 2011

Instant Mix for Music Beta by Google



Music Beta by Google was announced at the Day One Keynote of Google I/O 2011. This service allows users to stream their music collections from the cloud to any supported device, including a web browser. It’s a first step in creating a platform that gives users a range of compelling music experiences. One key component of the product, Instant Mix, is a playlist generator developed by Google Research. Instant Mix uses machine hearing to extract attributes from audio which can be used to answer questions such as “Is there a Hammond B-3 organ?” (instrumentation / timbre), “Is it angry?” (mood), “Can I jog to it?” (tempo / meter) and so on. Machine learning algorithms relate these audio features to what we know about music on the web, such as the fact that Jimmy Smith is a jazz organist or that Arcade Fire and Wolf Parade are similar artists. From this we can predict similar tracks for a seed track and, with some additional sequencing logic, generate Instant Mix playlists from songs in a user’s locker.

Because we combine audio analysis with information about which artists and albums go well together, we can use both dimensions of similarity to compare songs. If you pick a mellow track from an album, we will make a mellower playlist than if you pick a high energy track from the same album. For example, here we compare short Instant Mixes made from two very different tracks by U2. The first Instant Mix comes from "Mysterious Ways," an upbeat, danceable track from Achtung Baby with electric guitar and heavy percussion.


  1. U2 "Mysterious Ways"
  2. David Bowie "Fame"
  3. Oingo Boingo "Gratitude"
  4. Infectious Grooves “Spreck”
  5. Red Hot Chili Peppers “Special Secret Song Inside”
Compare this to a short Instant Mix made from a much more laid back U2 cut, "MLK" from the album Unforgettable Fire. This track has delicate vocals on top of a sparse synthesizer background and no percussion.


  1. U2 "MLK"
  2. Jewel “Don’t”
  3. Antony and the Johnsons “What Can I Do?”
  4. The Beatles “And I Love Her”
  5. Van Morrison “Crazy Love”
As you can hear, the “Mysterious Ways” Instant Mix is funky, with strong percussion and high-energy vocals while the “MLK” mix carries on with that track's laid-back lullaby feeling.

Our approach also allows us to create mixes from music in the long tail. Are you the lead singer in an unknown Dylan cover band? Even if your group is new or otherwise unknown, Instant Mix can still use audio similarity to match your tracks to real Dylan tracks (provided, of course, that you sing like Bob and your band sounds like The Band).

Our goal with Instant Mix is to build awesome playlists from your music collection. We achieve this by using machine learning to blend a wide range of information sources, including features derived from the music audio itself. Though we’re still in beta, and still have a lot of work to do, we believe Instant Mix is a great tool for music discovery that stands out from the crowd. Give it a try!

Further reading by Google Researchers:
Machine Hearing: An Emerging Field
Richard F. Lyon.

Sound Ranking Using Auditory Sparse-Code Representations
Martin Rehn, Richard F. Lyon, Samy Bengio, Thomas C. Walters, Gal Chechik.

Large-Scale Music Annotation and Retrieval: Learning to Rank in Joint Semantic Spaces
Jason Weston, Samy Bengio, Philippe Hamel.