6. Find & procurea crystal plastic replacement of a polycarbonate LEXAN 943
Main constraints:
•more resistant to detergent agents than LEXAN 943 (problem of cracking under combined effect of mechanical stress
and exposure to detergent agents)
•compatible with existing tools - withdrawal must be close to LEXAN 943
•optical characteristic close to LEXAN 943
•weldable by ultrasonic welding
•compliant with resistance to fire & smoke requirements 2 according to NFF16-101/102 and V0 according standard UL 94
delay : one week
organization centric search
7. Where is sold/operated the SA-24 Grinch 9K338 Igla-S portable air
defense missile system ?
location centric search
8. Recent information (past month)
about call for proposal
"outils Web innovants en entreprise" ?
time centric search
14. "In exploratory data analysis of high dimensional data
one of the main tasks is the formation of a
simplified, usually visual, overview of data sets.
....
Clustering and projection
are among the examples of useful methods
to achieve this task."
Fernando Lourenco, Victor Lobo, Fernando Bacao: Binary-based similarity measures for categorical data and their
application in self-organizing maps. JOCLAD 2004 - XI Jornadas de Classificacao e Anlise de Dados, April 1-3 , Lisbon (2004)
Lourenço, Lobo, Bação – JOCLAD 2004
15. WebNEM
collection of
relevant data,
anywhere in the web
+ projection on
Named Entities space
topical web crawler
named entity recognition
visualization/exploratory analysis tools
16. "Web scale" collection : brute force
never-ending crawl
fast answer,
"any" topic
a priori
"whole" Web indexing
general index
"everywhere"
huge resources required
(data size based)
user
query
17. "Web scale" collection : our approach
"close to optimal" resources
(usage based)
user
query
on-demand topical crawl
delayed answer,
but less garbage
tailored index
anywhere
relevant
built on order
Web slices
18. Projection : when to extract entities ?
Named Entity Recognition is resource intensive
crawl time whole web 1010 asynchronous
query time collection 102 real-time
crawl time web slice 104 asynchronous
process step data size required response time
19. www.squido.fr
our SaaS Web mining system
large scale
Named Entity extraction (EN/FR)
beta released to customers
June 2011
25. Linguistic processing throughput
deep extraction
too expensive
when crawling
shallow
extraction
OK
penalty
on
quality
workaround :
asynch deep extraction
on smaller collections
query time sanitization
29. & many more…
wrong
spelling
Tapei→Taipei
location is also a first name
"University of Michigan, Ann Arbor, MI"→Ann Arbor (person)
compound first names
"Jean-Claude Marin"→Claude Marin
wrong character case (very frequent on titles)
breaks all case-based rules
barrack obama→not extracted
How To Buy Electric Trucks→Buy Electric (organization)
In Virginia Life Is Sweet→Virginia Life (person)
polymorphism
"Nagy Bocsa", "Nagy-Bocsa", "Nagy"
sanitize parser output
for tokenization
transliteration, case, punctuation, …
31. Reminder
Next results are obtained
automatically
from unstructured content
picked on the web
by an autonomous system,
without previous knowledge
of the topic or the visited Web sites
32. Let's try it with a use case
"hydrogen storage for fuel cells"
What's inside a collection
of 66 highly ranked documents ?
run a few cycles
(shallow extraction only)
entity
weight function
(tf-idf, …)
some
104 pages
PeopleOrgs Location Time
38. Cities
"Austin is in a unique position
to offer its electric grid as a
real world proving ground"
"Direct Methanol Fuel Cells"
⇒alternative to H2
!
!
!
39. changeover from nickel to lithium
will be complete by 2016 and 2018
Multiple-dates timeline
outlookhistory
domains
time
Honda President Takanobu Ito says
around 10 percent of Honda’s global sales
will be hybrids by 2015
40. In a few clicks...
DMFC alternative to H2
Austin,
TX
hydrogen storage
for fuel cells ?
changeover from
nickel to lithium
by 2016/2018
42. To clean or not to clean ?
performance impact"attention" impact
run pipeline with/without cleaningcorpus
label examples +/-
clean
set
full
set
time full
pipeline
47. Lexical Taxonomies Induction
22nd International Joint Conference on Artificial Intelligence (IJCAI 2011),
Barcelona, Spain, July 19-22nd, 2011
another kind of projection
48. a. A real need of Attention-saving…
b. WebNEM results are encouraging
c. Work in progress, lots of paths to explore
6. Digest