Semi-automatic image annotation

L. Wenyin, S. Dumais, Y. Sun, H. Zhang, M. Czerwinski, and B. Field. Semi-automatic image annotation. In In INTERACT2001, 8th IFIP TC.13 Conference on Human-Computer Interaction, pages 326–333. Press, 2001. [PDF]

——

This article describe a technique to incorporate users’ feedback as annotations in a retrieval system for images. The basic argument of the authors is that manual annotation of image might be tedious to the user. Even direct annotation techniques proposed by Sneiderman do not solve the issue. On the other hand automatic annotation of images is not there yet.

Therefore they propose a ‘middle-earth’ approach to the problem asking users of an information retrieval engine to rate the results returned by the system. When results are marked positively, then the system incorporates the query terms as descriptors of the selected images.

To evaluate this architecture, the authors used two metrics: the retrieval accuracy and the annotation coverage. The annotation coverage is the percentage of annotated images in the database. The retrieval accuracy is how often the laveled items are correct. In their experiment, retrieval accuracy is the same as annotation coverage (positive examples are automatically annotated with the query keywords).

The author found that the annotation strategy is more efficient when there are some initial manual annotations. Additionally, they performed a usability test of the system (called MiAlbum) and they found that getting people to discover and use relevance feedback has been difficult. In addition, to improve the discoverability of feedback, the authors argue that we need to improve the participants’ understanding of the matching process.

200808111638.jpg

Does Organisation by Similarity Assist Image Browsing?

K. Rodden, W. Basalaj, D. Sinclair, and K. Wood. Does organisation by similarity assist image browsing? In Proceedings of CHI 2001, Seattle, Wa, USA, March 31-April 4 2001. Association for Computing Machinery. [PDF] [link to author’s site]

——-

The title of this paper nicely resumes the authors’ research question. They were interested in understanding whether organizing pictures by similarity might be beneficial to an picture retrieval task. Defining the similarity of two pictures is also an interesting problem and many researcher tried to provide solutions based on the image features or the multi-modal information which might be associated with them. The research reported in this paper was concerned with understanding whether a certain organization might have been more effective in helping the retrieval process than another one.

The authors used two kinds of organizations: 1) similarity of visual features; 2) similarity of text annotations. For the retrieval experiments they used information retrieval’s vector model, with binary term weighting, and the cosine coefficient measure. Also, they used a simulated work task situation, in which they asked graphic designers to look for sets of pictures to be used to complement articles for a magazine.

They conducted two experiments. The first one in which they tried to understand whether text-based organization was more useful than visual-based organization or a combination of the two. The majority of participants favoured the textual arrangements of pictures. In the second experiment, they compared more quantitatively a similarity arrangements to a random arrangement of pictures. They considered the time required to complete the task as the main dependent variable and analyzed the results with a linear regression model. Participants were slower with the visual arrangement than with the random selection of pictures. In the analysis the authors suggested that the visual arrangement made easy to find the target pictures however placing similar pictures together cause sometimes them to appear to merge, and therefore more difficult to parse.

Rodden_CHI_image_matrices.png

How do people manage their digital photographs?

K. Rodden and K. R. Wood. How do people manage their digital photographs? In Proceedings of CHI 2003, Fort Lauredale, Florida, USA, April 5-10 2003. [PDF] [PDF2]

———–

This paper describes a longitudinal research on how people manage their collections of digital photographs. The authors asked 13 subjects to used a prototipe software, named Shoebox, during 6 months to catalogue their pictures. The main features of the software was that of enabling text and voice tagging of the pictures. They found that after 6 months, these features were not used because participants relied efficiently on their memory and on the temporal sequence of the pictures to retrieve them.

The paper contains also an interesting argument in that text -based queries can be still reasonable effective even if spoken material is inaccurately transcribed (Brown et al., 1996).

The paper contains interesting comparison between digital photos and printed photos. Interestingly the study wa conducted in 2003, when digital pictures were still relatively new. The paper reports qualitative findings of how people used digital collections. Interestingly, participants adopted Shoebox archival feature, which organized pictures into Rolls and by their timestamps. However, participants did not increase the number of annotations they made on their individual pictures. They felt this feature was uninteresting because it was not helping them to increase their retrieval efficiency.

The availability of text-based indexing and reitrieval did not provide their participant extra motivation to invest the effort in annotating their pictures.

Speech-based annotation and retrieval of digital photographs

T. J. Hazen, B. Sherry, and M. Adler. Speech-based annotation and retrieval of digital photographs. In Proceedings of INTERSPEECH 2007, the 8th Annual Conference of the International Speech Communication Association, pages 2165–2168, Antwerp, Belgium, August 27-31 2007. [PDF]

———

This paper presents an application for supporting pictures retrieval on mobile phones using voice annotations. The authors’ basic assumption is that speech is more efficient than text for operating a mobile device and in general more efficient for conveying complex properties.

The core of the application they proposed is a mixed grammar recognition approach which allows the speech recognition system to construct a single finite-state network combining context-free grammars.

The paper present the evaluation of the application with a combination of a field deployment and a lab study where participants were asked to retrieve a set of pictures which were captured by themselves or by other participants. The retrieval was measures as the number of succesfull attempts to retrieve with the first query and whithin 5 queries.

Results indicated that users’ knowledge of the subject matter of the photographs was not playing a role in the retrieval process.

Expressive richness: a comparison of speech and text as media for revision

B. L. Chalfonte, R. S. Fish, and R. E. Kraut. Expressive richness: a comparison of speech and text as media for revision. In CHI ’91: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 21–26, New York, NY, USA, 1991. ACM. [PDF]

——–

This paper presents an experimental comparison of two modalities sharing annotations on a shared document. The authors designed this experiment under the assumption given by previous research that richer, more informal, and more interactive media should be better suited for ahndling collaborative tasks.

The authors designed an experiment for comparing subject reviewing a paper using text and speech annotations. They found that participant were more likely to make local annotations in the written modality (spelling and grammar changes) and more likely to make global annotations in the speech modality (structure, missing a fundamental point, project status, etc.).

They defined a nice metric: the index of self-correction, the number of times an annotation was corrected before the final submission. They found that speech annotations were more likely to be corrected than written annotations. They also found that speech allowed for non-verbal communication that made communication richer.

From the analysis it appeared that speech was superior because it was more expressive, and because it placed fewer cognitive demands on a communicator. However, this study did not focus in revealing the advantages or trade-offs of annotations in different modalities.

A comparison of speech versus typed input

A. G. Hauptmann and A. I. Rudnicky. A comparison of speech versus typed input. In Proceedings of the Third DARPA Speech and Natural Language Workshop, pages 219–224, San Matteo, 1990. Morgan Kaufmann. [pdf]

——-

This paper describes a controlled experiment in which the author compared two input modalities: text and speech. The author designed a task where the subjects had to input a number of numeric strings to the computer. They either used their voice, the keyboard or a combination of the two. They used a number of custom made metrics, like the transaction error rate (the number of transactions that were absolutely necessary divided by the total number of transactions); aggregate cycle time (the total time a subject needed to enter a number correctly).

The utterance accuracy results showed that speech requires many more interactions to complete the task than typing. According to the metrics defined typing reported better results in comparison to speech. The author discussed several reasons of why it was the case.

The author showed how speech compares with typing for the entry of digit string tasks. However, the authors cautioned how real world task, requiring more keystroke per syllable, would demonstrate the effectiveness of speech much better.

The authors concluded that depending on the task, speech can have a tremendous advantages for casual users. The more a task requires visual moniotoring of input the more preferable speech will become as an input medium. For skilled typist these relation might be reversed.

PhD thesis presentation: Annotations of Maps in Collaborative Work at a Distance

Yesterday, I presented publicly my thesis to the Faculty of Informatics and Communication at EPFL. I also delivered the final version of the thesis to the registrars office of the university. This final version has an extra page at the end with some minor corrections (thanks to Darren Gergle for the suggestions). The PDF of the thesis can be downloaded here (371 pages, 42.7 Mb), while the slide deck I used for the presentation of this work can be downloaded here (57 slides with notes, 5 Mb).

As a part of the presentation of the work, I prepared a little animation which shows a typical interaction an user might have with STAMPS. Here you can download this movie (divx, 700 Kb).

Title: “Annotations of Maps in Collaborative Work at a Distance”

(thesis director: Pierre Dillenbourg)

Abstract:

This thesis inquires how map annotations can be used to sustain remote collaboration. When we are face-to-face, we can point to things around us. However, at a distance, we need to recreate a context that can help disambiguate what we mean. A map can help recreate this context. However other technological solutions are required to allow deictic gestures over a shared map when collaborators are not co-located. This mechanism is here termed Explicit Referencing.

Two filed experiments were conducted to investigate the production of collaborative annotations of maps with mobile devices. Both studies led to very disappointing results. The reasons for this failure are attributed to the lack of a critical mass of users (social network), the lack of useful content, and limited social awareness. More importantly, the study identified a compelling effect of the way messages were organized in the tested application, which caused participants to refrain from engaging in content-driven explorations and synchronous discussions.

This last qualitative observation was refined in a controlled experiment where remote participants had to solve a problem collaboratively, using chat tools that differed in the way a user could relate an utterance to a shared map.

Results indicated that team performance is improved by the Explicit Referencing mechanisms. However, when this is implemented in a way that is detrimental to the linearity of the conversation, resulting in the visual dispersion or scattering of messages, its use has negative consequences for collaborative work at a distance. Additionally, a primary relation was found between the pair’s recurrence of eye movements and their task performance.

Finally, this thesis presents an algorithm that detects misunderstandings in collaborative work at a distance. It analyses the movements of collaborators’ eyes over the shared map, their utterances containing references to this workspace, and the availability of ‘remote’ deictic gestures. The algorithm associates the distance between the gazes of the emitter and gazes of the receiver of a message with the probability that the recipient did not understand the message.

Social functions of location in mobile telephony

I. Arminen. Social functions of location in mobile telephony. Personal Ubiquitous Computing, 10(5):319–323, 2006. [PDF]

——–

This paper describes a conversational analysis of mobile phone conversations. The author tried to understand why and how people communicate their location in phone calls. The study extends the study of Laurier and Weilenmann in that this study elaborate the way in which location features in mobile users’ communicative behavior.

The author finds five situations in which location is used in mobile conversations. The first is the interactional availability (registered in the author’s dataset 15% of the time), or the availability to discussing the content of the phone call; The second is when the communication of location has an importance in the ongoing activity (22% of the cases), like when a car driver has to ask direction to a remote speaker. Seemingly, location might assume importance during the call as part of the activity the parties are involved into (9%) and as a prompt for future activities (48%, the majority of situations registered). Finally, location can be communicated even if it does not have any relevance for the activity at hand. In this case the author say that it has a social relevance (6% of the cases).

One of the author’s main implication of this study lies in the fact that according to him, location is never considered in purely geographical terms. Location is made important by the activities in which the parties are involved. Particularly, joint activities make spatio-temporal patterns.

200806181643.jpg

Formulating availability and location in mobile phone conversations

A. Weilemann. “i can’t talk now, i’m in a fitting room”: formulating availability and location in mobile phone conversations. Environment and Planning A, 35(9):1589–1605, 2003. [pdf]

————-

This paper describes an analysis of recorded phone conversations. The author tried to understand the interrelations of location, activity, and availability. Indeed the author showed how in mobile phone conversations people exchange spatial information to infer the availability of the other to talk. Place can be inferred from the activity and vice-versa as people might have a good understanding of their peers whereabouts.

In this paper I investigate the ways in which participants in mobile-phone conversations orient to each other’s location, activities, and availability. By looking at data from recorded mobile-phone conversations, I use a conversation analytic approach to make initial observations on the character of mobile-phone conversations. I found that the frequent question “what are you doing?” sometimes caused a location to be given as part of the answer which shows how location, activity, and availability are strongly related. The participants thus obtained information about location, when this was considered relevant, through asking about activity. Location seemed especially relevant if it provided information about a future meeting. In some of the conversations where it seemed there was something going on where the ‘called party’ was located, the ‘caller’ reacted by initiating the conversation with a strategy which gave the called party a chance to end the conversation.