Lecture 16: Filtering & TDT
description
Transcript of Lecture 16: Filtering & TDT
![Page 1: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/1.jpg)
2006.03.09 - SLIDE 1IS 240 – Spring 2006
Prof. Ray Larson
University of California, Berkeley
School of Information Management & Systems
Tuesday and Thursday 10:30 am - 12:00 pm
Spring 2006http://www.sims.berkeley.edu/academics/courses/is240/s06/
Principles of Information Retrieval
Lecture 16: Filtering & TDT
![Page 2: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/2.jpg)
2006.03.09 - SLIDE 2IS 240 – Spring 2006
Overview
• Review– LSI
• Filtering & Routing
• TDT – Topic Detection and Tracking
![Page 3: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/3.jpg)
2006.03.09 - SLIDE 3IS 240 – Spring 2006
Overview
• Review– LSI
• Filtering & Routing
• TDT – Topic Detection and Tracking
![Page 4: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/4.jpg)
2006.03.09 - SLIDE 4IS 240 – Spring 2006
How LSI Works
• Start with a matrix of terms by documents• Analyze the matrix using SVD to derive a
particular “latent semantic structure model”• Two-Mode factor analysis, unlike
conventional factor analysis, permits an arbitrary rectangular matrix with different entities on the rows and columns – Such as Terms and Documents
![Page 5: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/5.jpg)
2006.03.09 - SLIDE 5IS 240 – Spring 2006
How LSI Works
• The rectangular matrix is decomposed into three other matices of a special form by SVD– The resulting matrices contain “singular
vectors” and “singular values”– The matrices show a breakdown of the original
relationships into linearly independent components or factors
– Many of these components are very small and can be ignored – leading to an approximate model that contains many fewer dimensions
![Page 6: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/6.jpg)
2006.03.09 - SLIDE 6IS 240 – Spring 2006
How LSI Works
TitlesC1: Human machine interface for LAB ABC computer applicationsC2: A survey of user opinion of computer system response timeC3: The EPS user interface management systemC4: System and human system engineering testing of EPSC5: Relation of user-percieved response time to error measurementM1: The generation of random, binary, unordered treesM2: the intersection graph of paths in treesM3: Graph minors IV: Widths of trees and well-quasi-orderingM4: Graph minors: A survey
Italicized words occur and multiple docs and are indexed
![Page 7: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/7.jpg)
2006.03.09 - SLIDE 7IS 240 – Spring 2006
How LSI Works
Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4Human 1 0 0 1 0 0 0 0 0Interface 1 0 1 0 0 0 0 0 0Computer 1 1 0 0 0 0 0 0 0User 0 1 1 0 1 0 0 0 0System 0 1 1 2 0 0 0 0 0Response 0 1 0 0 1 0 0 0 0Time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0Survey 0 1 0 0 0 0 0 0 0Trees 0 0 0 0 0 1 1 1 0Graph 0 0 0 0 0 0 1 1 1Minors 0 0 0 0 0 0 0 1 1
![Page 8: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/8.jpg)
2006.03.09 - SLIDE 8IS 240 – Spring 2006
How LSI Works
Dimension 2
Dimension 1
11graphM2(10,11,12)
10 Tree12 minor
9 survey
M1(10) 7 time
3 computer
4 user6 response
5 system
2 interface1 human
M4(9,11,12)
M2(10,11)C2(3,4,5,6,7,9)
C5(4,6,7)
C1(1,2,3)
C3(2,4,5,8)
C4(1,5,8)
Q(1,3)Blue dots are termsDocuments are red squaresBlue square is a queryDotted cone is cosine .9 from Query “Human Computer Interaction”-- even docs with no terms in common(c3 and c5) lie within cone.
SVD to 2 dimensions
![Page 9: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/9.jpg)
2006.03.09 - SLIDE 9IS 240 – Spring 2006
How LSI Works
X T0=
S0 D0’
txd txm mxm mxd
X = T0S0D0’
docs
terms
T0 has orthogonal, unit-length columns (T0’ T0 = 1)D0 has orthogonal, unit-length columns (D0’ D0 = 1)S0 is the diagonal matrix of singular valuest is the number of rows in Xd is the number of columns in Xm is the rank of X (<= min(t,d)
![Page 10: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/10.jpg)
2006.03.09 - SLIDE 10IS 240 – Spring 2006
Overview
• Review– LSI
• Filtering & Routing
• TDT – Topic Detection and Tracking
![Page 11: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/11.jpg)
2006.03.09 - SLIDE 11IS 240 – Spring 2006
Filtering
• Characteristics of Filtering systems:– Designed for unstructured or semi-structured data– Deal primarily with text information– Deal with large amounts of data– Involve streams of incoming data– Filtering is based on descriptions of individual or
group preferences – profiles. May be negative profiles (e.g. junk mail filters)
– Filtering implies removing non-relevant material as opposed to selecting relevant.
![Page 12: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/12.jpg)
2006.03.09 - SLIDE 12IS 240 – Spring 2006
Filtering
• Similar to IR, with some key differences• Similar to Routing – sending relevant incoming
data to different individuals or groups is virtually identical to filtering – with multiple profiles
• Similar to Categorization systems – attaching one or more predefined categories to incoming data objects – is also similar, but is more concerned with static categories (might be considered information extraction)
![Page 13: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/13.jpg)
2006.03.09 - SLIDE 13IS 240 – Spring 2006
Structure of an IR System
SearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
![Page 14: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/14.jpg)
2006.03.09 - SLIDE 14IS 240 – Spring 2006
Structure of an Filtering System
Interest profilesRaw Documents
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
IncomingDataStream
Potentially Relevant
Documents
Comparison/filtering
Store1: Profiles/Search requests
Doc surrogateStream
Indexing/Categorization/
Extraction
Formulating query in terms of
descriptors
Storage of profiles
Information Filtering System
Adapted from Soergel, p. 19
Individual or Groupusers
![Page 15: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/15.jpg)
2006.03.09 - SLIDE 15IS 240 – Spring 2006
Major differences between IR and Filtering
• IR concerned with single uses of the system• IR recognizes inherent faults of queries
– Filtering assumes profiles can be better than IR queries
• IR concerned with collection and organization of texts– Filtering is concerned with distribution of texts
• IR is concerned with selection from a static database.– Filtering concerned with dynamic data stream
• IR is concerned with single interaction sessions– Filtering concerned with long-term changes
![Page 16: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/16.jpg)
2006.03.09 - SLIDE 16IS 240 – Spring 2006
Contextual Differences
• In filtering the timeliness of the text is often of greatest significance
• Filtering often has a less well-defined user community
• Filtering often has privacy implications (how complete are user profiles?, what to they contain?)
• Filtering profiles can (should?) adapt to user feedback– Conceptually similar to Relevance feedback
![Page 17: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/17.jpg)
2006.03.09 - SLIDE 17IS 240 – Spring 2006
Methods for Filtering
• Adapted from IR – E.g. use a retrieval ranking algorithm against
incoming documents.
• Collaborative filtering– Individual and comparative profiles
![Page 18: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/18.jpg)
2006.03.09 - SLIDE 18IS 240 – Spring 2006
TREC Filtering Track
• Original Filtering Track– Participants are given a starting query – They build a profile using the query and the training
data– The test involves submitting the profile (which is not
changed) and then running it against a new data stream
• New Adaptive Filtering Track– Same, except the profile can be modified as each
new relevant document is encountered.
• Since streams are being processed, there is no ranking of documents
![Page 19: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/19.jpg)
2006.03.09 - SLIDE 19IS 240 – Spring 2006
TREC-8 Filtering Track
• Following Slides from the TREC-8 Overview by Ellen Voorhees
• http://trec.nist.gov/presentations/TREC8/overview/index.htm
![Page 20: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/20.jpg)
2006.03.09 - SLIDE 20IS 240 – Spring 2006
![Page 21: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/21.jpg)
2006.03.09 - SLIDE 21IS 240 – Spring 2006
![Page 22: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/22.jpg)
2006.03.09 - SLIDE 22IS 240 – Spring 2006
![Page 23: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/23.jpg)
2006.03.09 - SLIDE 23IS 240 – Spring 2006
![Page 24: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/24.jpg)
2006.03.09 - SLIDE 24IS 240 – Spring 2006
Overview
• Review– LSI
• Filtering & Routing
• TDT – Topic Detection and Tracking
![Page 25: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/25.jpg)
2006.03.09 - SLIDE 25IS 240 – Spring 2006
TDT: Topic Detection and Tracking
• Intended to automatically identify new topics – events, etc. – from a stream of text and follow the development/further discussion of those topics
![Page 26: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/26.jpg)
2006.03.09 - SLIDE 26IS 240 – Spring 2006
Topic Detection and Tracking
Introduction and
Overview– The TDT3 R&D Challenge
– TDT3 Evaluation
Methodology
Slides from “Overview NIST Topic Detection and Tracking -Introduction and Overview” by G. Doddington-http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt99/presentations/index.htm
![Page 27: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/27.jpg)
2006.03.09 - SLIDE 27IS 240 – Spring 2006
TDT Task Overview*
• 5 R&D Challenges:– Story Segmentation– Topic Tracking– Topic Detection– First-Story Detection– Link Detection
• TDT3 Corpus Characteristics:†– Two Types of Sources:
• Text • Speech
– Two Languages:• English 30,000
stories• Mandarin 10,000
stories
– 11 Different Sources:• _8 English__ 3
MandarinABC CNN VOAPRI VOA XINNBC MNB ZBNAPW NYT
** see http://www.itl.nist.gov/iaui/894.01/tdt3/tdt3.htm for details† see http://morph.ldc.upenn.edu/Projects/TDT3/ for details
![Page 28: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/28.jpg)
2006.03.09 - SLIDE 28IS 240 – Spring 2006
Preliminaries
A topictopic is …a seminal eventevent or activity, along with all
directly related events and activities.
A storystory is …a topically cohesive segment of news that
includes two or more DECLARATIVE independent clauses about a single event.
![Page 29: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/29.jpg)
2006.03.09 - SLIDE 29IS 240 – Spring 2006
Example Topic
Title: Mountain Hikers Lost– WHAT: 35 or 40 young Mountain Hikers were
lost in an avalanche in France around the 20th of January.
– WHERE: Orres, France – WHEN: January 1998– RULES OF INTERPRETATION: 5. Accidents
![Page 30: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/30.jpg)
2006.03.09 - SLIDE 30IS 240 – Spring 2006
(for Radio and TV only)
Transcription:text (words)
Story:Non-story:
The Segmentation Task:
To segment the source stream into its constituent stories, for all audio sources.
![Page 31: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/31.jpg)
2006.03.09 - SLIDE 31IS 240 – Spring 2006
Story Segmentation Conditions
• 1 Language Condition:
• 3 Audio Source Conditions:
• 3 Decision Deferral Conditions:
Both English and Mandarin
manual transcriptionASR transcriptionoriginal audio data
AudioEnglish (words)
Mandarin (characters)
English & Mandarin (seconds)
100 150 301,000 1,500 300
10,000 15,000 3,000
TextMaximum Decision Deferral Period
![Page 32: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/32.jpg)
2006.03.09 - SLIDE 32IS 240 – Spring 2006
The Topic Tracking Task:
To detect stories that discuss the target topic,in multiple source streams.
• Find all the stories that discuss a given target topic– Training: Given Nt sample stories that
discuss a given target topic,– Test: Find all subsequent stories that
discuss the target topic.
on-topicunknownunknown
training data
test dataNew This Year: not guaranteed to be off-topic
![Page 33: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/33.jpg)
2006.03.09 - SLIDE 33IS 240 – Spring 2006
Topic Tracking Conditions
• 9 Training Conditions:
• 1 Language Test Condition:
• 3 Source Conditions:
• 2 Story Boundary Conditions:
Training Language
English MandarinBoth
Sources1 (E) 1 (M) 1 (E), 1(M)2 (E) 2 (M) 2 (E), 2(M)4 (E) 4 (M) 4 (E), 4(M)
N t
English (E) Mandarin (M)
Both English and Mandarin
text sources and manual transcription of the audio sourcestext sources and ASR transcription of the audio sourcestext sources and the sampled data signal for audio sources
Reference story boundaries providedNo story boundaries provided
![Page 34: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/34.jpg)
2006.03.09 - SLIDE 34IS 240 – Spring 2006
The Topic Detection Task:
To detect topics in terms of the (clusters of) storiesthat discuss them.
– Unsupervised topic training A meta-definition of topic is required independent of topic specifics.
– New topics must be detected as the incoming stories are processed.
– Input stories are then associated with one of the topics.
a topic!
![Page 35: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/35.jpg)
2006.03.09 - SLIDE 35IS 240 – Spring 2006
Topic Detection Conditions
• 3 Language Conditions:
• 3 Source Conditions:
• Decision Deferral Conditions:
• 2 Story Boundary Conditions:
English onlyMandarin only
English and Mandarin together
Reference story boundaries providedNo story boundaries provided
text sources and manual transcription of the audio sourcestext sources and ASR transcription of the audio sourcestext sources and the sampled data signal for audio sources
Maximum decision deferral period in # of source files
110
100
![Page 36: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/36.jpg)
2006.03.09 - SLIDE 36IS 240 – Spring 2006
• There is no supervised topic training (like Topic Detection)
Time
First Stories
Not First Stories
= Topic 1= Topic 2
The First-Story Detection Task:
To detect the first story that discusses a topic, for all topics.
![Page 37: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/37.jpg)
2006.03.09 - SLIDE 37IS 240 – Spring 2006
First-Story Detection Conditions
• 1 Language Condition:
• 3 Source Conditions:
• Decision Deferral Conditions:
• 2 Story Boundary Conditions:
English only
Reference story boundaries providedNo story boundaries provided
text sources and manual transcription of the audio sourcestext sources and ASR transcription of the audio sourcestext sources and the sampled data signal for audio sources
Maximum decision deferral period in # of source files
110
100
![Page 38: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/38.jpg)
2006.03.09 - SLIDE 38IS 240 – Spring 2006
The Link Detection Task
To detect whether a pair of stories discuss the same topic.
• The topic discussed is a free variable.• Topic definition and annotation is
unnecessary.• The link detection task represents a basic
functionality, needed to support all applications (including the TDT applications of topic detection and tracking).
• The link detection task is related to the topic tracking task, with Nt = 1.
same topic?
![Page 39: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/39.jpg)
2006.03.09 - SLIDE 39IS 240 – Spring 2006
Link Detection Conditions• 1 Language Condition:
• 3 Source Conditions:
• Decision Deferral Conditions:
• 1 Story Boundary Condition:
English only
text sources and manual transcription of the audio sourcestext sources and ASR transcription of the audio sourcestext sources and the sampled data signal for audio sources
Maximum decision deferral period in # of source files
110
100
Reference story boundaries provided
![Page 40: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/40.jpg)
2006.03.09 - SLIDE 40IS 240 – Spring 2006
TDT3 Evaluation Methodology• All TDT3 tasks are cast as statistical detection (yes-
no) tasks.– Story Segmentation: Is there a story boundary here?– Topic Tracking: Is this story on the given topic?– Topic Detection: Is this story in the correct topic-
clustered set?– First-story Detection: Is this the first story on a topic?– Link Detection: Do these two stories discuss the same
topic?• Performance is measured in terms of detection cost,
which is a weighted sum of miss and false alarm probabilities: CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget)
• Detection Cost is normalized to lie between 0 and 1: (CDet)Norm = CDet / min{CMiss • Ptarget, CFA • (1- Ptarget)}
![Page 41: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/41.jpg)
2006.03.09 - SLIDE 41IS 240 – Spring 2006
Example Performance Measures:
0.01
0.1
1
Engl
ish
Man
darin
No
rma
lize
d T
rack
ing
Co
st
Tracking Results on Newswire Text (BBN)
![Page 42: Lecture 16: Filtering & TDT](https://reader033.fdocumento.com/reader033/viewer/2022051115/56814979550346895db6c6d5/html5/thumbnails/42.jpg)
2006.03.09 - SLIDE 42IS 240 – Spring 2006
More on TDT
• Some slides from James Allan from the HICSS meeting in January 2005