Big Data Poolza Talk: Aspects of Semantic Processing

53 %
47 %
Information about Big Data Poolza Talk: Aspects of Semantic Processing
Technology

Published on February 23, 2014

Author: bokononisms

Source: slideshare.net

Description

This inaugural, Meetup talk, sponsored by the Knowledgent Group, discussed aspects of semantic processing, and emphasized using python for lexical semantics. Slides cite example code snippets for computing the relationships between words using the Natural Language Toolkit (NLTK) in Python. There is also a small overview of the technologies underlying the Semantic Web and text mining.

Knowledgent Big Data-palooza: Aspects of Semantic Processing Na’im R. Tyson, PhD February 6, 2014

Discussion Topics • Semantic Processing – What is Semantics? – What is Pragmatics? • Lexical Semantics – Computing Semantic Similarity ∗ WordNet ∗ Vector Space Modeling • Ontology Basics • Text Mining: Basics 1

Semantic Processing • What is Semantics? – Study of literal meanings of words and sentences ∗ Lexical Semantics - word meanings & word relations – Sometimes stated formally using some logical form ∗ Example: ∀x∃yloves(x, y) • What is Pragmatics? – Study of language use and its situational contexts (discourse, deixis, presupposition, etc.) 2

Lexical Semantics WordNet: Description • Word relation database • Created by George Miller & Christiane Fellbaum (Miller, 1995; Fellbaum, 1998) @ Princeton University • Types of Relationships Synonymy - word pair similarity Antonymy - word pair dissimilarity Meronymy - part-of relation – Example: ’engine’ and ’car’ Hyponymy - subordinate relation between words (i.e., a type-of relation) – Example: ’red’ is a hyponym of ’color’ (’red’ is a type of color) Hypernymy - superordinate relation between words 3

– Example: ’color’ is a hypernym of ’red’ Question: What’s the relationship between a hyponym and a hypernym? • 150K words w/ 115k synsets and approx. 200k word-sense pairs 4

Lexical Semantics • Adapted from Python Text Processing with NLTK 2.0 Cookbook (Perkins, 2010) >>> from nltk.corpus import wordnet as wn >>> word_synset = wn.synsets(’cookbook’)[0] >>> word_synset.name ’cookbook.n.01’ >>> word_synset.definition ’a book of receipes and cooking directions’ 5

Lexical Semantics • Antonymy: >>> ga1 = wn.synset(’good.a.01’) >>> ga1.definition ’having desirable or positive qualities especially those suitable for a thing specified’ >>> bad = ga1.lemmas[0].antonyms()[0] >>> bad.name ’bad’ >>> bad.synset.definition ’having undesirable or negative qualities’ 6

Lexical Semantics • Hyponymy & Hypernymy: >>> word_synset.hyponyms() >>> word_synset.hypernyms() 7

Computing Similarity by WordNet • Similarity by Path Length (see Perkins, 2010, p. 19) >>> from nltk.corpus import wordnet as wn >>> cb = wn.synset(’cookbook.n.01’) >>> ib = wn.synset(’instruction_book.n.01’) >>> cb.wup_similarity(ib) # Wu-Palmer Similarity 0.91666666666666663 • For path similarity explanations, see Jaganadhg (2010) 8

Advantages & Disadvantages • Advantages Quality: developed and maintained by researchers Practice: applications can use WordNet Software: SenseRelate (Perl) - http://senserelate.sourceforge.net • Disadvantages Coverage: technical terms may be missing Irregularity: path lengths can be irregular across hierarchies Relatedness: related terms may not be in the same hierarchies Example: Tennis Problem – ’player’, ’racquet’, ’ball’ and ’net’ 9

Computing Word Similarity by Vector Space Modeling • Computing Similarity from a Document Corpus Goal: determine distributional properties of a word Steps: In general... – Create vector of size n for each word of interest – Think of them as points in some n-dimensional space – Use a similarity metric to compute distance Algorithm: Brown et al. (1992) – C(x) - vector with properties of x (context of ’x’) – C(w) = #(w1), #(w2), ..., #(wk ) , where #(wi) is the number of times wi followed w in a corpus 10

11

Similarity Measure: Cosine Cosine cos(⃗ , ⃗ ) = x y ⃗ ∗⃗ x y |⃗ ||⃗| x y n = i=1 n i=1 xi yi n x2 i=1 y2 cosmonaut astronaut moon car truck Soviet 1 0 0 1 1 American 0 1 0 1 1 spacewalking 1 1 0 0 0 red 0 0 0 1 1 full 0 0 1 0 0 old 0 0 0 1 1 , xn ) cos(cosm, astr) = 1∗0+0∗1+1∗1+0∗0+0∗0+0∗0 12 +02 +12 +02 +02 +02 02 +12 +12 +02 +02 +02 Figure 1: Cosine Similarity Comparison from Collins (2007) Outline 12

13

Similarity Measure: Euclidean n i=1 (xi Euclidean |⃗ , ⃗ | = |⃗ − ⃗ | = x y x y − yi )2 cosmonaut astronaut moon car truck Soviet 1 0 0 1 1 American 0 1 0 1 1 spacewalking 1 1 0 0 0 red 0 0 0 1 1 full 0 0 1 0 0 old 0 0 0 1 1 • • • euclidian(cosm, astr) = (1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2 Figure 2: Euclidean Similarity Comparison from Collins (2007) 14

Cosine & Euclidean Similarity in Python >>> import numpy as np >>> from scipy.spatial import distance as dist >>> cosm = np.array([1,0,1,0,0,0]) >>> astr = np.array([0,1,1,0,0,0]) >>> dist.cosine(cosm, astr) 1.0 >>> dist.euclidean(cosm, astr) 2.4494897427831779 15

Computing Word Similarity by Vector Space Modeling • Advantages & Disadvantages – Requires no database lookups – Semantic similarity doesn’t imply synonymy, antonymy, meronymy, hyponymy, hypernymy, etc. 16

Ontology Basics • Semantic Web Technologies – – – – Data Models Ontology Language Distributed Query Language Applications ∗ Large knowledge bases ∗ Business Intelligence 17

Ontology Basics Figure 3: Cambridge Semantics’ simplified view of Semantic Web solutions. 18

Ontology Basics • W3C Semantic Web – RDF - Resource Description Framework ∗ Data model w/ identifiers and named relations b/t resource pairs ∗ Represented as directed graphs b/t resources and literal values · Done w/ collections of triples · triple: subject, predicate and object 1. Na’im Tyson born in 197x 2. Na’im Tyson works for Knowledgent 3. Knowledgent headquartered Warren – SPARQL - SPARQL Protocol And RDF Query Language ∗ Query language of Semantic Web ∗ Queries RDF stores over HTTP ∗ Very similar to SQL – Capturing Relationships RDF Schema: Vocabulary (term definitions), Schema (class definitions) and Taxonomies (defining hierarchies) 19

OWL: Expressive relation definitions (symmetry, transitivity, etc.) RIF: Rules Interchange Form - representation for exchanging sets of logical and business rules 20

Text Mining Basics • What people think Text Mining is? – Automated discovery of new previously unknown information, by automatically extracting information from a usually amount of different unstructured textual resources (Wasilewska, 2014) 21

Text Mining Basics • What text mining really is? Data Mining Information Retrieval Text Mining Statistics Web Mining Computational Linguistics & Natural Language Processing Figure 4: Venn Diagram of Text Mining (Wasilewska, 2014). 22

Text Mining Basics • A General Approach — ignore Process Text Mining the cloud! • Document Clustering • Text Characteristics Interpretation / Evaluation Data Mining / Pattern Discovery Attribute Selection Text Transformation (Attribute Generation) Text Preprocessing Text Figure 5: General Approaches to Text Mining Process (Wasilewska, 2014). 23

Text Mining Basics • Application - Document Clustering Goal: Group large amounts of textual data Techniques: High Level – k-means - top down ∗ cluster documents into k groups using vectors and distance metric – agglomerative hierarchical clustering - bottom up ∗ Start with each document being a single cluster ∗ Eventually all documents belong to the same cluster ∗ Documents represented as a hierarchy (dendogram) Reference: Taming Text (see Ingersoll et al., 2013, chap. 6) • Final Remarks 24

THANK YOU!! 25

References Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. Class-based n-gram models of natural language. Computational Linguistics, 18:467–479, 1992. Michael Collins. Lexical Semantics: Similarity Measures and Clustering, November 2007. URL http://www.cs.columbia.edu/∼mcollins/6864/slides/wordsim.4up.pdf. Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998. Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris. Taming Text: How to Find, Organize, and Manipulate It. Manning Publications Co., January 2013. Jaganadhg. Wordnet sense similarity with nltk: some basics, October 2010. URL http://jaganadhg.freeflux.net/blog/archive/tag/WSD/. 26

George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38(11):39–41, 1995. Jason Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, 2010. Anita Wasilewska. CSE 634 - Data Mining: Text Mining, January 2014. URL http://www.cs.sunysb.edu/ cse634/presentations/TextMining.pdf. 27

Add a comment

Related presentations

Related pages

Big Crisis Data

Big Crisis Data Social Media in Disasters ... natural language processing, semantic technologies, data ... Videos of talks and seminars on social media ...
Read more

Seeing Graphs in Smart Data Lakes - Datanami: Big Data ...

Seeing Graphs in Smart Data Lakes. ... owes to the semantic aspect of how the data is ... The Bright Future of Semantic Graphs and Big Connected Data. ...
Read more

Big Data Analytics using Open Source Technology | Infosys

Tech Talk Big Data Analytics Enters a ... are among the leaders in collating, processing, and analyzing big data into ... that pools in the data from ...
Read more

Machine Intelligence - Research at Google

Machine Intelligence at Google raises deep scientific ... Towards Big Data ... Combinatorial and Algorithmic Aspects of Sequence Processing ...
Read more

Medical Big Data Analysis in Hospital Information System ...

... for medical Big Data processing and one for semantic framework development to ... all aspects of Big Data processing. ... pools, virtual computing ...
Read more

Ringvorlesung on Big Data, Internet of Things and Data ...

Ringvorlesung on Big Data, Internet of Things and Data ... of Semantic Web technologies, Web 2.0 data and ... to big data. In this talk, ...
Read more

SAIC » Big Data & Analytics

SAIC’s big data engineering solutions permit data scientists to forgo ... Pool Data Assets from Across ... Expertise in semantic data description ...
Read more

Data Lakes: Complicating Big Data Governance - DATAVERSITY

Data Lakes: Complicating Big Data ... collection of disconnected data pools, ... to initially parse through the data to provision semantic ...
Read more

Jeffrey Dean - Research at Google

Some aspects of our ... a system for simplifying the development of large-scale data processing ... Jeffrey Dean. Keynote talk given ...
Read more