SlideShare a Scribd company logo
1 of 25
Download to read offline
Importing Wikipedia in Plone
Eric BREHAULT – Plone Conference 2013
ZODB is good at storing objects
● Plone contents are objects,
● we store them in the ZODB,
● everything is fine, end of the story.
But what if ...
... we want to store non-contentish records?
Like polls, statistics, mail-list subscribers, etc.,
or any business-specific structured data.
Store them as contents anyway
That is a powerfull solution.
But there are 2 major problems...
Problem 1: You need to manage a secondary
system
● you need to deploy it,
● you need to backup it,
● you need to secure it,
● etc.
Problem 2: I hate SQL
No explanation here.
I think I just cannot digest it...
How to store many records in the ZODB?
● Is the ZODB strong enough?
● Is the ZCatalog strong enough?
My grandmother often told me
"If you want to become stronger, you have to eat your soup."
Where do we find a good soup for Plone?
In a super souper!!!
souper.plone and souper
● It provides both storage and indexing.
● Record can store any persistent pickable data.
● Created by BlueDynamics.
● Based on ZODB BTrees, node.ext.zodb, and repoze.catalog.
Add a record
>>> soup = get_soup('mysoup', context)
>>> record = Record()
>>> record.attrs['user'] = 'user1'
>>> record.attrs['text'] = u'foo bar baz'
>>> record.attrs['keywords'] = [u'1', u'2', u'ü']
>>> record_id = soup.add(record)
Record in record
>>> record['homeaddress'] = Record()
>>> record['homeaddress'].attrs['zip'] = '6020'
>>> record['homeaddress'].attrs['town'] = 'Innsbruck'
>>> record['homeaddress'].attrs['country'] = 'Austria'
Access record
>>> from souper.soup import get_soup
>>> soup = get_soup('mysoup', context)
>>> record = soup.get(record_id)
Query
>>> from repoze.catalog.query import Eq, Contains
>>> [r for r in soup.query(Eq('user', 'user1')
& Contains('text', 'foo'))]
[<Record object 'None' at ...>]
or using CQE format
>>> [r for r in soup.query("user == 'user1' and 'foo' in text")]
[<Record object 'None' at ...>]
souper
● a Soup-container can be moved to a specific ZODB mount-
point,
● it can be shared across multiple independent Plone instances,
● souper works on Plone and Pyramid.
Plomino & souper
● we use Plomino to build non-content oriented apps easily,
● we use souper to store huge amount of application data.
Plomino data storage
Originally, documents (=record) were ATFolder.
Capacity about 30 000.
Plomino data storage
Since 1.14, documents are pure CMF.
Capacity about 100 000.
Usally the Plomino ZCatalog contains a lot of indexes.
Plomino & souper
With souper, documents are just soup records.
Capacity: several millions.
Typical use case
● Store 500 000 addresses,
● Be able to query them in full text and display the result on a map.
Demo
What is the limit?
Can we import Wikipedia in souper?
Demo with 400 000 records
Demo with 5,5 millions of records
Conclusion
● Usage performances are good,
● Plone performances are not impacted.
Use it!
Thoughts
● What about a REST API on top of it?
● Massive import is long and difficult, could it be improved?
Makina Corpus
For all questions related to this talk,
please contact Éric Bréhault
eric.brehault@makina-corpus.com
Tel : +33 534 566 958
www.makina-corpus.com

More Related Content

Similar to Importing Large Datasets in Plone with Souper

Centralized logging system using mongoDB
Centralized logging system using mongoDBCentralized logging system using mongoDB
Centralized logging system using mongoDBVivek Parihar
 
Async IO and Multithreading explained
Async IO and Multithreading explainedAsync IO and Multithreading explained
Async IO and Multithreading explainedDirecti Group
 
Buildout: creating and deploying repeatable applications in python
Buildout: creating and deploying repeatable applications in pythonBuildout: creating and deploying repeatable applications in python
Buildout: creating and deploying repeatable applications in pythonCodeSyntax
 
python-160403194316.pdf
python-160403194316.pdfpython-160403194316.pdf
python-160403194316.pdfgmadhu8
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB
 
Python Seminar PPT
Python Seminar PPTPython Seminar PPT
Python Seminar PPTShivam Gupta
 
Running a Plone product on Substance D
Running a Plone product on Substance DRunning a Plone product on Substance D
Running a Plone product on Substance DMakina Corpus
 
A novel approach to Undo
A novel approach to UndoA novel approach to Undo
A novel approach to UndoSean Upton
 
Linux multiplexing
Linux multiplexingLinux multiplexing
Linux multiplexingMark Veltzer
 
GSoC2014 - PGCon2015 Presentation June, 2015
GSoC2014 - PGCon2015 Presentation June, 2015GSoC2014 - PGCon2015 Presentation June, 2015
GSoC2014 - PGCon2015 Presentation June, 2015Fabrízio Mello
 
Python_Introduction&DataType.pptx
Python_Introduction&DataType.pptxPython_Introduction&DataType.pptx
Python_Introduction&DataType.pptxHaythamBarakeh1
 
Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporeDhruv Gohil
 
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...GeeksLab Odessa
 
python presntation 2.pptx
python presntation 2.pptxpython presntation 2.pptx
python presntation 2.pptxArpittripathi45
 

Similar to Importing Large Datasets in Plone with Souper (20)

Centralized logging system using mongoDB
Centralized logging system using mongoDBCentralized logging system using mongoDB
Centralized logging system using mongoDB
 
Async IO and Multithreading explained
Async IO and Multithreading explainedAsync IO and Multithreading explained
Async IO and Multithreading explained
 
Buildout: creating and deploying repeatable applications in python
Buildout: creating and deploying repeatable applications in pythonBuildout: creating and deploying repeatable applications in python
Buildout: creating and deploying repeatable applications in python
 
python-160403194316.pdf
python-160403194316.pdfpython-160403194316.pdf
python-160403194316.pdf
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best Practices
 
Data analysis with pandas
Data analysis with pandasData analysis with pandas
Data analysis with pandas
 
Data Analysis With Pandas
Data Analysis With PandasData Analysis With Pandas
Data Analysis With Pandas
 
Python Seminar PPT
Python Seminar PPTPython Seminar PPT
Python Seminar PPT
 
Python
PythonPython
Python
 
python into.pptx
python into.pptxpython into.pptx
python into.pptx
 
Running a Plone product on Substance D
Running a Plone product on Substance DRunning a Plone product on Substance D
Running a Plone product on Substance D
 
A novel approach to Undo
A novel approach to UndoA novel approach to Undo
A novel approach to Undo
 
Linux multiplexing
Linux multiplexingLinux multiplexing
Linux multiplexing
 
GSoC2014 - PGCon2015 Presentation June, 2015
GSoC2014 - PGCon2015 Presentation June, 2015GSoC2014 - PGCon2015 Presentation June, 2015
GSoC2014 - PGCon2015 Presentation June, 2015
 
Python_Introduction&DataType.pptx
Python_Introduction&DataType.pptxPython_Introduction&DataType.pptx
Python_Introduction&DataType.pptx
 
Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at Singapore
 
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
 
python presntation 2.pptx
python presntation 2.pptxpython presntation 2.pptx
python presntation 2.pptx
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Plone on RelStorage
Plone on RelStoragePlone on RelStorage
Plone on RelStorage
 

Recently uploaded

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Importing Large Datasets in Plone with Souper

  • 1. Importing Wikipedia in Plone Eric BREHAULT – Plone Conference 2013
  • 2. ZODB is good at storing objects ● Plone contents are objects, ● we store them in the ZODB, ● everything is fine, end of the story.
  • 3. But what if ... ... we want to store non-contentish records? Like polls, statistics, mail-list subscribers, etc., or any business-specific structured data.
  • 4. Store them as contents anyway That is a powerfull solution. But there are 2 major problems...
  • 5. Problem 1: You need to manage a secondary system ● you need to deploy it, ● you need to backup it, ● you need to secure it, ● etc.
  • 6. Problem 2: I hate SQL No explanation here.
  • 7. I think I just cannot digest it...
  • 8. How to store many records in the ZODB? ● Is the ZODB strong enough? ● Is the ZCatalog strong enough?
  • 9. My grandmother often told me "If you want to become stronger, you have to eat your soup."
  • 10. Where do we find a good soup for Plone? In a super souper!!!
  • 11. souper.plone and souper ● It provides both storage and indexing. ● Record can store any persistent pickable data. ● Created by BlueDynamics. ● Based on ZODB BTrees, node.ext.zodb, and repoze.catalog.
  • 12. Add a record >>> soup = get_soup('mysoup', context) >>> record = Record() >>> record.attrs['user'] = 'user1' >>> record.attrs['text'] = u'foo bar baz' >>> record.attrs['keywords'] = [u'1', u'2', u'ü'] >>> record_id = soup.add(record)
  • 13. Record in record >>> record['homeaddress'] = Record() >>> record['homeaddress'].attrs['zip'] = '6020' >>> record['homeaddress'].attrs['town'] = 'Innsbruck' >>> record['homeaddress'].attrs['country'] = 'Austria'
  • 14. Access record >>> from souper.soup import get_soup >>> soup = get_soup('mysoup', context) >>> record = soup.get(record_id)
  • 15. Query >>> from repoze.catalog.query import Eq, Contains >>> [r for r in soup.query(Eq('user', 'user1') & Contains('text', 'foo'))] [<Record object 'None' at ...>] or using CQE format >>> [r for r in soup.query("user == 'user1' and 'foo' in text")] [<Record object 'None' at ...>]
  • 16. souper ● a Soup-container can be moved to a specific ZODB mount- point, ● it can be shared across multiple independent Plone instances, ● souper works on Plone and Pyramid.
  • 17. Plomino & souper ● we use Plomino to build non-content oriented apps easily, ● we use souper to store huge amount of application data.
  • 18. Plomino data storage Originally, documents (=record) were ATFolder. Capacity about 30 000.
  • 19. Plomino data storage Since 1.14, documents are pure CMF. Capacity about 100 000. Usally the Plomino ZCatalog contains a lot of indexes.
  • 20. Plomino & souper With souper, documents are just soup records. Capacity: several millions.
  • 21. Typical use case ● Store 500 000 addresses, ● Be able to query them in full text and display the result on a map. Demo
  • 22. What is the limit? Can we import Wikipedia in souper? Demo with 400 000 records Demo with 5,5 millions of records
  • 23. Conclusion ● Usage performances are good, ● Plone performances are not impacted. Use it!
  • 24. Thoughts ● What about a REST API on top of it? ● Massive import is long and difficult, could it be improved?
  • 25. Makina Corpus For all questions related to this talk, please contact Éric Bréhault eric.brehault@makina-corpus.com Tel : +33 534 566 958 www.makina-corpus.com