"Tailor Made Concordancer: (Semi-) Big Data Corpora and Flexible Open Source Software", Tobias Gärtner, Consultant Big Data at Advisori FTC GmbH
Tailor Made Concordancer: (Semi-) Big Data
Corpora and Flexible Open Source Software
4. The Writing Centre Triangle
www.advisori.de
4
Student
Scientific
Instructor
Writing
Instructor
no communication
Missing knowledge:
Content
Linguistics
Academic Traditions
Missing knowledge:
Content
Linguistics
Academic Traditions
5. A Missmatch in Communication
www.advisori.de
5
A Chinese student
of mechanical engineering
writing a bachelors‘s thesis
in German
A German language instructor
with a masters degree
in social sciences
No idea of mechanical engineering in
terms of content & academic traditions
No idea of German meta language &
German academic traditions
6. An Example
• Which verb goes together with “regression”:
a. Fit
b. Estimate
c. Calculate
d. Predict
e. Compute
f. I-hope-it-is-not-contagious
www.advisori.de
6
7. Solution Strategies
• Ask a dictionary
• Ask Google
• Ask the student
• Ask someone else
• Have a look at the respective literature
There are no specialised dictionaries
How would you?
She/he does not know
Your colleagues know as much as you know
A good starting point
www.advisori.de
7
10. The Task
Design, programme and implement a tool
that helps language instructors
working at writing centres
to support students
writing in a foreign language
www.advisori.de
10
11. Some Challanges
• No one wants to use a programme with such a syntax:
• [a-z]*[vbp]s[a-zs]*sregression[a-z]
• Sentence boundaries need to be respected
• It needs to run online, offline, on Windows, Windows Server, Linux, Linux Servers and Mac (hey why
not on a smartphone as well)
• It needs to be easily maintainable
• It needs to return high quality results without being to techy regarding IT and linguistic special terms
• It needs to be cheap (i.e. for free)
• It needs to work with German, English and Russian texts
www.advisori.de
11
15. Query Input and Programme Output
www.advisori.de
15
KWIC
Collocations
N-Grams
Readings
LSA Associations
Frequencies
Complexity
Words Lemmata POS Tags Of each up to 5 One Corpus Two Corpora
Complexity
Output:
Query Input:
16. Contact Details
www.advisori.de
16
Feel free to contact me:
Via E-Mail: tobias.gaertner@advisori.de
On Xing: https://www.xing.com/profile/Tobias_Gaertner35
On LinkedIn: https://www.linkedin.com/in/tobias-g%C3%A4rtner-b11205125/
Did you know we are hiring?