The ERMITES 2013 Summer School brings together international leading researchers and provides participants the opportunity to gain deeper insight into current research trends in scaled audiovisual information retrieval. It is organized as a series of long talks, during which students are invited to interact.
The target audience of the school is graduate and PhD students, post-doctoral researchers, and academic or industrial researchers.
Any participant can present its research in a poster or oral format, please contact the organizer if you wish to do so.
As there is a limited number of participants (about 32), a confirmation notification will be sent (first come first served politic).
IMPORTANT DATES:
- Deadline for registration: 9th of September evening.
INVITED TALKS: Samy Bengio, Senior Researcher, Google, USA
'Large Scale Image/Music Understanding'
Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of ima
ges to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a so-called 'embedding space', into which both images and annotations can be automatically projected.
In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. One can even build a visio-semantic tree from these annotations, that corresponds to how concepts (anno
tations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts. We propos
e a new learning-to-rank approach that can scale to such dataset and show some annotation results. Such an idea can be used for
many other problems including music recommendation, which I'll describe briefly.
Jorge Sanchez, A. Prof. Cordoba univ., Argentina
'Fisher Vectors for Large Scale Classification'
The Fisher Vector (FV) has been introduced in classification as an alternative to
the popular Bag-of-Words (BOV) image representation. As in the BOV, images are
characterized by summary statistics computed from a set of low-level patch
descriptors extracted from the image. In the FV framework, the sample is characterized
by its deviation with respect to a generative model of the data.
Such a
representation is given by a gradient vector w.r.t the parameters of the model, which
is chosen to be a Gaussian Mixture with diagonal covariances. The FV has many
advantages compared to the BOV. First, it gives a more complete representation of
the samples, as it considers information that goes beyond simple counts. Second
, by encoding additional information, it requires smaller vocabularies to achieve
a given accuracy. This makes the FV very efficient to compute. Third, its classification
performance rank among the bests in a wide range of problems, despite
relying on simple linear classifiers.
We'll will first present a formal overview of the FV framework, showing
some recent results on several small- to large-scale problems. Next, I'll discuss
some extensions to the FV which show the generality and modeling power of
the approach. Finally, I'll present some applications to other classification re
lated problems.
Adrien Gaidon, Research Scientist, Xerox Research Center Europe, France and
Cordelia Schmid, DR INRIA, LEAR, Grenoble, France
'Action recognition from videos: some recent results'
Part 1:
Automatic video understanding is a growing need for many applications in order to manage and exploit the enormous - and ever-increasing - volume of available video data. In particular, recognition of human activities is important, since videos are often about people doing something. Modelling and recognizing actions is as yet an unsolved issue. In this talk, we will present original methods that yield significant performance improvements by leveraging both the content and the spatio-temporal structure of videos.
First, we will describe some robust models capturing both the content of and the relations between action parts. Our approach consists in organizing collections of robust local features into structured action representations, for which we propose efficient kernels. Even if they share the same underlying principles, our methods differ in terms of the type of problem they address and the structural information they rely on.
Part 2:
Second, we will talk about some recent advances in video representation, namely trajectory-based video features, which have shown to outperform the state of the art. These features are obtained by dense point sampling and tracking based on displacement information from a dense optical flow field. Trajectory descriptors are obtained with motion boundary histograms, which are robust to camera motion. Third, we will also show how to move towards more structured representations by explicitly modeling human-object interactions using the relative trajectory of an object with respect to a human. Finally, we will present work learning object detectors from real-world web videos using a fully automatic pipeline that localizes objects in a set of videos and learns a detector for it. The approach extracts candidate spatio-temporal tubes based on motion segmentation and then selects one tube per video jointly over all videos. (Joint work with C. Schmid, Z. Harchaoui, V. Ferrari, H. Grabner, A. Klaeser, A. Prest, H. Wang.)
Barbara Caputo, Senior Researcher, IDIAP EPF Lausanne, Switzerland
'Learning to learn in computer vision & robotics: some success stories and challenges ahead'
The awareness that learning of categories and concepts from multi modal data should be a never ending, dynamic process, has led to a growing interest in algorith
ms for leveraging over priors over the last years. This interest has been declined in different ways in dif
ferent communities: while the visual recognition and robotics community have focused mostly on designing algorithms able to come with large scale concept learning
from multi modal data, machine learning research has been developing theoretical frameworks able (to some
extent) to explain the experimental success of several of these methods. In this lecture I will give an overview of the several settings where learning to learn
has been applied (from domain adaptation to transfer learning), review the current state of the art in the
se research threads, link these algorithms to machine learning theories and outline the open challenges ahead. I will also provide links to various online resour
ces, from software to established benchmark databases.
Matthieu Cord, Prof. Paris 6 univ., LIP6, France
'Beyond Bag of Visual Word model for image representation'
I will focus on few extensions of classical Bag-of-(Visual)-Words (BoVW) model widely used approach to represent visual documents. BoVW relies on the quantization of local descriptors and their aggregation into a single feature vector. The underlying concepts, such as the visual codebook, coding, pooling, and the impact of the main parameters of the BoVW pipeline will be discussed with few propositions about pooling.
Recently, unsupervised learning methods have emerged to jointly learn visual codebooks and codes. I will present approaches based on restricted Boltzmann machines (RBM) to achieve this joint optimization. To enhance feature coding, RBMs may be regularized with a sparsity constraint term. I will show experimental results of this code learning strategy embedded in the BoVW pipeline for image classification. Some extensions concerning hierarchical and bio-inspired approaches for image representation will also be discussed. Additionally to classification, I will present some applications in content-based image retrieval, focusing on interactive-learning based approaches.
Sebastien Paris, A. Prof. Aix Marseille univ., France
'Efficient Bag of Scenes Analysis for Image Categorization'
We address the general problem of image/object categorization with a novel approach referred to
as Bag-of-Scenes (BoS).Our approach is efficient for low semantic applications such as texture classification
as well as for higher semantic tasks such as natural scenes recognition or fine-grained visual categorization
. It is based on the widely used combination of (i) Sparse coding (Sc), (ii) Max-pooling and (iii) Spatial
Pyramid Matching techniques applied to histograms of multi-scale Local Binary/Ternary Patterns
(LBP/LTP) and its improved variants. This approach can be considered as a two-layer hierarchical architec-
ture: the first layer encodes the local spatial patch structure via histograms of LBP/LTP while the second en-
codes the relationships between pre-analyzed LBP/LTP-scenes/objects. Our method outperforms SIFT-based
approaches using Sc techniques and can be trained efficiently with a simple linear SVM.
Marc Le Goc, Prof. Aix Marseille univ., France
'Learning with the Theory of Timed Observations'
We present the recent Theory of Timed Observations (TTO, Le Goc 2006), based on new mathematical object called Timed Observation. It merges notably the results of the Bayesian Networks and Markov Chains theories, those of the Poisson Processes theory, Shannon's theory of communication and the logical theory of diagnosis.
With the extension of the informational entropy concept to the temporal dimension of the data, the TTO provides (i) the basis of a reasoning process to induce temporal knowledge from timed data, (ii) the organizational laws of the discovered knowledge in a 4-tuple model of the dynamic process that produces the timed data and (iii) an abstraction principle allowing a multi-scale modeling. The Theory is then the 1st mathematical basis of a learning process that combines a Knowledge Engineering methodology called Tom4D (Timed Observations Modeling for Diagnosis) and a Knowledge Discovery in Database process called Tom4L (Timed Observation Modeling for Learning).
The advantage of TTO is to unify the representation formalisms of the Tom4D and the Tom4L: the human and the data knowledge sources are associated within a unique learning process combining the advantages of human learning with those of the machine. The main contribution of TTO is to model a data production process without any prior knowledge. We present the main concepts and properties of TTO with didactic examples, and real-world applications (continuous production processes, smart environments and financial industry). Through an experimental and conceptual benchmark, we show show that Tom4L process provides better results than the best comparable learning algorithms. We conclude on the new problems that TTO introduces, in particular for the validation of the induced knowledge models, and the future developments : multi-scale modeling of the brain.
Patrick Pirim, Brain Vision Systems, Paris
'Scaled bio-inspired perception system'
BVS has recently developed the BIPS: Bio Inspired Perception System, a real time perception processor based on the human visual system which integrates in one single application: perception, understanding and command of a unit.
Since 1986, the first implementation of this model in silicon chip, many applications have driven the technique as evolution does in time. Collaborative work with academic persons as Alain BERTHOZ, Yves BURNOD, Jean-Arcady MEYER and many others have help the integration in this direction. Today we can propose a generic perceptive model in a single silicon chip with different kind of inputs as vision, sound, touch, etc... After a brief story of theses 26 years of development, we will present the actual chip version BIPS (Bio-Inspired Perception System) applied in vision field.
This perception model presented is divided in three modalities; Global, Dynamic, and Structural. For each modality a neural population activities represented by a spatio-temporal histogram computation gives the 'What and Were' situation. The adjunction of the dynamic recruitment and the lateral inhibition permits the self organization of object description.
For an application, it is necessary to perceive, understand and react in real time, more you perceive more you can understand easier, the perception is mandatory. We will show you as example: an autonomous car driver, a crossing road control, ..etc.
Seeing the future today, we present the next evolution of this model towards a generic cortical column with a scalable integration. It becomes a generic perception computer, by an external driving depending of the application, the perspective is huge.
Herve Glotin, Prof. Toulon univ. & Inst. Univ. de France
'Sparse and Scattering operators for bioacoustic classification'
After a brief introduction to the machine learning for automatic speech recognition, we demonstrate the main difficulties for its applications to bioacoustics. We then discuss on two efficient approaches for scaled classification of animal sounds : the sparse coding and the scattering operators. We illustrate their advantages with various cases of species, from bats to whales.
A more detailed illustration is given for Humpback whale songs that present several similarities to speech, including voiced and unvoiced type vocalizations, a great variety of methods have been used to analyze them. Most of the studies of these songs are based on the classification of sound units, however detailed analysis of the vocalizations showed that the features of an unit can change abruptly throughout its duration making it difficult to characterize and cluster them systematically. We then show how joint sparse coding and scattering operators can help to determine in a song the stable components versus the evolving ones. This results in a separation of the song components, and then highlights song copying between males accross years.
We will illustrate the scaled bioacoustic paradigm with an overview of the ICML workshop that we organize in 2013 :
Bioacoustic classification challenge at ICML 2013.
This work is supported by IUF and
Scaled Acoustic Biodiversity SABIOD MASTODONS CNRS project.
Registration Fees (payment by CB or invoice to USTV)
You may choose between 1 or 3 days pack, single or shared room studio.
The 3 days pack
includes: 2 nights, 5 meals, 2 breakfasts, coffee breaks, proceedings,
with D1 or D2 registrations
for double shared room studio,
D1 : Only for PhD., Post-doctorate, Master = 300 euros,
D2 : Other (Full position, company) = 450 euros.
Or with S1 or S2 formula for single room,
S1 : Like D1 but single room = 330 euros,
S2 : Like D2 but single room = 480 euros.
The
daily pack includes 1 meal, coffee break, proceedings, without sleeping accommodations.
Daily student: PhD, Post-doctorate, Master = 70 euros per day,
Daily non-student = 100 euros per day.
You can either pay by invoice, or credit card at this adress :
DO ONLINE REGISTRATION BY CB
Access : ERMITES 13 is at IGESA center, in the middle of Porqueroles island, with access from Hyeres TGV station then bus (67), or Toulon International Airport, then boat (15 mn). We may also organize car travels from Hyeres to the boat - More details on trains / boats.
Social activities : a little walk starting from IGESA to the Cap Grand Langoustier will offer a great breath in this paradise to the attendies, and the opportunity to extend unformal discussions :
Committees :
Organizing co. : J. Razik (pres), X. Halkias, H. Glotin, Y. Doh, C. Rabouy, M. Bartcus, O. Dufour.
Program co. : H. Glotin (Pres.), S. Bengio, X. Halkias, S. Paris, J. Razik, T. Artieres, L. Ralaivola, F. Chamroukhi.
Contact : ermites@gmail.com
ClusterMap from 13.1.13