presentation givent at the 2nd International Workshop on Web Intelligence & Virtual Enterprises (WIVE'10) held at the 11th IFIP Working Conference on Virtual Enterprises (PRO-VE'10)
http://www.emse.fr/wive/
Why Teams call analytics are critical to your entire business
Cloud based Web Intelligence
1. "How I Learned to Stop Worrying and Love the Bomb"
WIVE 2010
2. An unhinged computer scientist, known
as TimBL, has invented the WWW,
plunging the world into an information
vortex…
Now everybody is fighting to prevent
the knowledge apocalypse...
Recently, Dr. StrangeCloud, a former
mainframe virtualization specialist, has
been called to the rescue…
(story inspired by Kubrick's Dr. Strangelove)
3. Is Dr. Strangecloud going to save the
planet from the ever increasing danger of
death by information overload ?
4.
5. Web Intelligence
within/for VE
Virtual Organiza
resource sharing
Cloud based WI for VE
6. Looking for YACCP ?
Yet Another
Cloud Computing
Presentation ?
You'd better check the specialists instead…
8. as a research &
application field
(COMPSAC 2000, Taiwan)
9. Crossing hot topics
Artificial Intelligence
web
mining
web WEB
semantic
INTELLIGENCE
information retrieval web
cloud
our
computing focu
s
toda
y
Information Technology
10. Hot, mild or cold ?
(based on Wikipedia article popularity)
cumulated Wikipedia page views, Jan => June 2010
(source : access statistics from wikipedia's squid cluster as compiled by http://stats.grok.se/)
Lady Gaga
11 344 529
6 month trends for Wikipedia pages Web
Intelligence, Cloud Computing and Lady Gaga
10 000 000
1 000 000
100 000
10 000
1 000
100
jan feb mar apr may jun
Cloud Computing
1 911 127
Web Intelligence
1 632
11. But hot local
Web Intelligence
recipe
www.web-intelligence-rhone-alpes.org
12.
13. cloud based domain knowledge
repository enrichment
(use case in FP7 project proposal)
millions
crawls/month
web
crawler
triple 90M
triples
publication in LOD cloud
store initially
in put r
ual s/yea
semantic
n
ma triple
2.5
M extractor
14. UKWA
by The British Library
crawl, annotate , preserve
visual analysis & navigation
powered by
IBM BigSheets
on British Library private cloud
(demo on www.webarchive.org.uk/analytics/analytics.htm)
(details on news.cnet.com/8301-13846_3-10459507-62.html)
15. Public Terabyte Dataset
by Bixo Labs
50-200M pages from the 1M top US domains
SimpleDB
Elactic MapReduce S3
powered by
Hadoop Bixo
on AWS cloud
Tika
Avro Cascading
not yet available (09/2010)
big corpus ready for AWS based analysis
(WI research, evaluation, ...)
16.
17. the Web Intelligence paradox
All the Web data is at hand, ready for WI research and applications
2 simple steps :
pick up process it with all those
the data marvelous ML algorithms...
Wait a minute, it's not that simple ! What about :
politeness ?
scale ? heterogeneity?
(aka "crappiness")
copyright ?
18. Use the Semantic Web?
Looking for semantic annotations in 82k web pages
(Squido production systems, 01/2010)
less than 3%
19. kindof real world WI process
millions pages dedicated bandwith
crawl lot's of memory
lot's of i/o's
clean
(ML,...)
lot's of threads
process
lot's of CPU
(ML, ...)
21. Load may scale up or
down considerably with crawl size
when
testing/calibrating
consider
Cloud Computing
in production
if no crawl limits
22. 1 .45 automatic.
2 boxes of ammunition.
4 days' concentrated emergency rations.
1 drug issue containing antibiotics, morphine, vitamin pills, pep pills, sleeping pills, tranquilizer
pills.
1 miniature combination Russian phrase book and Bible.
100 dollars in rubles.
100 dollars in gold.
9 packs of chewing gum.
1 issue of prophylactics.
3 lipsticks.
3 pairs of nylon stockings.
23. Build from other's
Top 10 Lessons Learned from Deploying Hadoop in a Private Cloud
(Rod Cope, OpenLogic's CTO, CloudSlam'10)
24.
25. "Cloud computing is a trap"
warns GNU founder Richard Stallman
"It's stupidity.
It's worse than
stupidity: it's a
marketing hype
campaign."
(www.guardian.co.uk/technology/2008/sep/29/cloud.computing.richard.stallman)
=> we can still consider private cloud+OSS
26. web-scale
distributed crawl OSS
not mature
(Heritrix Cluster Controller build server exception)
Cloud OSS on the rise
(www.blackducksoftware.com/oss/projects/#cloud)
OSS stack for DC/DML
under active
development
28. Crawling
is the launch pad
in Web Intelligence
Don't take it easy !
Get yourself
a decent crawler
29. Crawling by millions
is not trivial...
many large objects www crappiness
in memory : means
transient ? endless ugly special
persistent ? cases
customizable revisit politeness is
policy ? challenging
30. DDOS is at the corner
with (poor) cloud based crawling
31. Infrastructure is not always key to perfs
Organic effect
of politeness fetch rate
on performance drops
over time
(ken-blog.krugler.org)
1,264,539 URLs from
41,978 unique domains
10 slaves cluster
4000 active fetch threads max
opportunity
brute force
to scale down !
32. a. Cloud Computing is worth considering for WI
b. Have a cloud survival kit
c. Consider private cloud & OSS
d. Compare prices
e. Get yourself a decent crawler
f. Don't turn into DDOS
g. Infrastructure is not always key to perfs
33. "SaaS intelligence on web data, for professionnals"
collect
share filter
monitor analyse www.squido.fr
35. Photos: Websites:
1. National Nuclear Security Administration/Nevada Site Office
wikipedia.org
2. Dr. Strangelove/Original film poster by Tomi Ungerer
3. Dr. Strangelove/movie still www.emse.fr/wive/
4. Dr. Strangelove/movie still csrc.nist.gov
6. cloudslam10.com/Gartner keynote slide, cloudslam10.com
National Institute of Standards and Technology web site screenshot www.web-intelligence-rhone-alpes.org
7. cia.gov/OHB lobby seal picture
stats.grok.se
8. amazon.com/Computational Web Intelligence book cover
10. Wikimedia Commons/Lady Gaga by petercruise www.ibm.com/software/ebusiness/jstart/bigsheets
12. Wikimedia Commons/Operation Crossroads Baker in color.jpg bixolabs.com/datasets/public-terabyte-dataset-project
13. Linking Open Data cloud diagram, by Richard Cyganiak and Anja www.openlogic.com
Jentzsch. http://lod-cloud.net/
www.blackducksoftware.com
14. flickr/British Library III/jovike,
ibm.com/The_British_Library_and_IBM_Bi.jpg crawler.archive.org
16. Dr. Strangelove/movie still www.apache.org
21. Wikimedia Commons/Castle Bravo Blast.jpg twitter.com
22. Dr. Strangelove/movie still
ken-blog.krugler.org
23. cloudslam10.com/OpenLogic slide
24. Dr. Strangelove/movie still
25. Wikimedia Commons/RMS iGNUcius techfest iitb.JPG
27. cloudslam10.com/OpenLogic slide
28. Wikimedia Commons/Peacekeeper_missile_after_silo_launch.jpg
31. kkrugler.files.wordpress.com/2009/05/fetch-performance2.png
32. Dr. Strangelove/movie still