The document discusses strategies for researching the deep web and hidden internet. It provides definitions of the deep web and names for it. Some key points made include that the deep web contains pages not indexed by search engines, dynamically generated content, password protected sites, and sites that exclude search engine bots. The document discusses techniques like using specialized browsers, surgical browsing of high-value sites, leveraging networks, and planning search strategies with targeted sources as ways to research the deep web.
2011 Mining Unique Information Sources & Deep Invisible-Hidden-Opaque Web Recap Final
1. Anna F. Shallenberger
President & “Chief Archer”
Shallenberger Intelligence Services June 14, 2011
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use
2. Definitions vary as to what it is / is not
Many names – deep, invisible, hidden, opaque etc
Surface web is “visible” portion
Baseline Research, particularly re size, dated
Term coined by Michael Bergman…
Anna F. Shallenberger
President & “Chief Archer”
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 2
3. Pages lower ranked due to Search Engine Optimization [SEO]
Sites coded to exclude bots
Dynamic content generated by page search – stats, etc
Search engine chooses not cover whole site due volume of context
Format – video/image w/o text/tags, or they’re incomplete
Site/pages not connected with pages browser(s)
Anna F. Shallenberger
President & “Chief Archer” Password protected pages
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 3
4. Figure 1. Search Engines: Dragging a Net Across the Web's Surface
Anna F. Shallenberger
President & “Chief Archer”
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 4
5. Anna F. Shallenberger
President & “Chief Archer”
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
Q4 NPD Search &Portal Site Study, reported by Search Engine Watch
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 5
6. Pearl Grow …
ID Leakage Points [factoring in copyright & other IP concerns]
Non Central Hosts of Content
E. G. Content not controlled by “HQ”
Surgical Manual Browsing
Dark Web Browsers [also use pathfinders]
Anna F. Shallenberger
President & “Chief Archer” Leverage Your {on & offline} Networks ….
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 6
7. Conduct
Search
Evaluate What
You Find…
Mine the most on
point for more
ideas …
Anna F. Shallenberger Refine your search strategy,
President & “Chief Archer” Continue your investigation,
Shallenberger Intelligence Services Repeat process as needed…
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 7
8. Anna F. Shallenberger
President & “Chief Archer”
Shallenberger Intelligence Services
anna@targetedknowledge.com www.pearltrees.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 8
9. Look For…
Concepts/Terms/Catch Phrases etc
Names – Experts/Reports/Publications
URL Roots , e.g. are the most relevant loaded on the same part of site
Revisit strategy adding incremental terms and/or
re-weighting / editing Boolean linkages
Don’t forget the reverse
Anna F. Shallenberger
President & “Chief Archer”
What are key terms repeating in the “false drops”
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 9
10. [factoring in copyright & other IP concerns]
Analogy: Similar to emotional conversations,
where “speaker” may or may not
Intend for public [or so much of them] to “hear” [have access to] it
Fully comprehend others’ valuation of the information
Understand originators perspective – “But I only told that to…
And they promised not to tell anyone…”
Information Leakage Points
Anna F. Shallenberger
President & “Chief Archer” Reuse by others – clients, ex employees
Shallenberger Intelligence Services
anna@targetedknowledge.com
Conference Presentations
203.258.2383
917.591.6732 fax Case Studies or other Sales & Marketing Collateral
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
Continuing Education [especially MBA classes]
http://closetlibrarian.blogspot.com Social Media
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 10
11. Non Central Content Hosts, e. g. content not “HQ” controlled
Branch offices of consultants, research firms, usually non-US
Biz units migrating tech platforms [generally post-merger]
Satellite campuses, larger academic institutions
Event-driven sites – conferences, product introductions, etc…
Anna F. Shallenberger
President & “Chief Archer” Non-merger partnerships / joint initiatives
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383 “Relationships” NEC–specialized social networks , non-profits..
917.591.6732 fax
www.targetedknowledge.com
[sometimes exec bios NOT pasted verbatim from corp site]
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 11
12. Dark Web / Specialized Browsers, etc
Anna F. Shallenberger
President & “Chief Archer”
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 12
13. Surgical Browsing
Identify Potential High-Value Sites
Navigate Manually
Create Your Own Site Index Using a Browser
It can include
Downloading {smaller} sites into Adobe to browse offline
Looking for cross-linking to site, especially several layers in
Locating historical content in caches or archiving sites
Anna F. Shallenberger
President & “Chief Archer”
Shallenberger Intelligence Services
Be Careful
anna@targetedknowledge.com
203.258.2383 While limiting searches by doc type [pdf etc] is effective
917.591.6732 fax
www.targetedknowledge.com Searchable layers can mask them behind other file types
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 13
14. Leveraging Who You Know, Digitally & Offline
What do people in that field read, on & offline ?
What would they consider a waste of time?
A large part of the challenge is indexing…
And you need to ID what they “miss”
Sometimes there’s no GPS, must already know
where you’re going, or at least a mid-point…
Anna F. Shallenberger
President & “Chief Archer”
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 14
15. Needles in Haystacks aren’t invisible,
but they can be more work to locate
Some even hide in plain sight
Have a plan, but flex it as needed
Take good notes, bookmark good leads, save best hits
Might not find them again or they change
ALWAYS Consider the Source
Manage time spent, don’t get lost
Anna F. Shallenberger Be Flexible, but still…
President & “Chief Archer”
Shallenberger Intelligence Services
Plan Ahead
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 15
16. Planning Ahead
*AFS example based on model designed by KnowledgeInforrm
Based On Questions You
Are Seeking To Answer
ID Potential Sources, &
“Pearl Grow” From There
Anna F. Shallenberger
President & “Chief Archer”
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 16
17. ID/Target Sources
{categorization subjective – many fit multiple}
Influencers
Consultants / Think Tanks
Pollsters / Market Researchers
Academia
Governmental
NGOs & Advocacy Groups
Trade/Professional Associations
Other Niche Organizations
Businesses & Publishers, NEC
Aggregators/Re-packagers/Peer-Sharing , NEC
Anna F. Shallenberger PEOPLE NEC
President & “Chief Archer”
Shallenberger Intelligence Services
anna@targetedknowledge.com
203.258.2383
917.591.6732 fax
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
www.linkedin.com/in/annafayshallenberger
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 17
18. Deep Web: Surfacing Hidden Value
www.brightplanet.com/images/uploads/DeepWebWhitePaper_20091015.pdf
Marcus Zillman: Deep Web Research
www.llrx.com/features/deepweb2011.htm or www.deepwebresearch.info
Chris Sherman @ Information Online
www.docstoc.com/Docs/Document-Detail-14.aspx?doc_id=84592274
August Jackson [for SCIP] Getting Most ….
http://homepage.mac.com/cornfed/internetdeepweb.pdf
Using Web Investigative Reporting Tool www.slideshare.net/tccj/web-as-investigative-tool or
http://campuscoverage.org/sites/default/files/Docs/Presentations/CCPInternet.ppt
Model & Analyze Deep Web
www.scribd.com/doc/59496007/Modeling-and-Analyze-the-Deep-Web-Surfacing-Hidden-Value
Anna F. Shallenberger
President & “Chief Archer” Accurate & Efficient Crawling Deep Web
Shallenberger Intelligence Services www.scribd.com/doc/57147960/Accurate-And-Efficient-Crawling-The-Deep-Web-Surfacing-Hidden-Value
anna@targetedknowledge.com
203.258.2383 Web & Twitter Archiving @ Library of Congress
917.591.6732 fax www.slideshare.net/nullhandle/web-and-twitter-archiving-at-the-library-of-congress
www.targetedknowledge.com
http://twitter.com/ClosetLibrarian CRS report to Congress www.docstoc.com/docs/84024621/CRS-Report-for-Congress
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com How Much Information [UC Berkeley]
www.linkedin.com/in/annafayshallenberger http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf
www.ci2020.com/profile/AnnaFShallenberger
http://tinyurl.com/AIIP-AFS
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 18
19. No member of a crew is praised for the rugged individuality of his rowing.
~Ralph Waldo Emerson
Thanks & Best Regards,
Anna F. Shallenberger
203.258.2383 cell
917.591.6732 fax
anna@targetedknowledge.com
http://twitter.com/ClosetLibrarian
http://www.slideshare.net/ClosetLibrarian
http://closetlibrarian.blogspot.com
An experienced researcher, educator, author, blogger, strategist & consultant,
Anna Shallenberger, aka the ClosetLibrarian, was recently recognized in Best of
the Business Web & featured on SlideShare’s home page.
At SLA 2011 , Anna was a panelist for “Integrating with Sales & Marketing to
Capture & Deliver Intelligence” & led an "Intelligence Café“ discussion regarding
Unique Information Sources & the Deep Web. She was also a spotlight panelist
@ SLA 2010 & served as conference planner for the CI Division.
Anna F Shallenberger, All Rights Reserved, for educational use only, not for redistribution or commercial re-use 19