Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
362 views

Python Code Examples

The document provides code examples for several Python tasks including: 1) Word spotting in a text file and printing matching words. 2) Creating a dictionary of first names from a text file. 3) Computing accuracy results by comparing words in two text files and calculating accuracy metrics. 4) Creating word histograms by counting word frequencies in text. 5) Extracting people names and company names from text using regular expressions.

Uploaded by

Asaf Katz
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
362 views

Python Code Examples

The document provides code examples for several Python tasks including: 1) Word spotting in a text file and printing matching words. 2) Creating a dictionary of first names from a text file. 3) Computing accuracy results by comparing words in two text files and calculating accuracy metrics. 4) Creating word histograms by counting word frequencies in text. 5) Extracting people names and company names from text using regular expressions.

Uploaded by

Asaf Katz
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

Python Code Examples

Word Spotting
import sys fname1 = "c:\Python Course\ex1.txt" for line in open(fname1,'r').readlines(): for word in line.split(): if word.endswith('ing'): print word

Creating a Dictionary of First Names


def createNameDict(): dictNameFile=open('project/dictionaries/names.txt','r') dictContent=dictNameFile.read() #read all the file dictWords=dictContent.split(",") #return a list with the words nameDict={} # initialize a dictionary for word in dictWords: nameDict[word.strip()]=" " #enters each word to the dctionary. return nameDict

Computing Accuracy Results I


# anfiles.py # Program to analyze the results of speaker identification. # Illustrates Python dictionarys import string, glob, sys def main(): # read correct file and test file fname1 = sys.argv[1] fname2 = sys.argv[2]

text1 = open(fname1,'r').read() text1 = string.lower(text1) words1 = string.split(text1)

correct_len = len(words1)
text2 = open(fname2,'r').read() text2 = string.lower(text2) words2 = string.split(text2)

Computing Accuracy Results II


# construct a dictionary of correct results correct = {} for w in words1: correct[w] = 1

for i in range(correct_len): in_count = 0 portion2 = words2[:i+1] for w in portion2: if correct.get(w,0) > 0: in_count+=1 accuracy = float(in_count)/float(len(portion2)) print "%5d, %5d,%.2f" % (len(portion2), in_count, accuracy)
if __name__ == '__main__': main()

Word Histograms
import sre, string pattern = sre.compile( r'[a-zA-Z]+' ) def countwords(text): dict = {} try: iterator = pattern.finditer(text) for match in iterator: word = match.group() try: dict[word] = dict[word] + 1 except KeyError: dict[word] = 1 except sre.error: pass # triggers when first index goes to -1, terminates loop.

Word Histograms
items = [] for word in dict.keys(): items.append( (dict[word], word) ) items.sort() items.reverse() return items # if run as a script, count words in stdin. if __name__ == "__main__": import sys x = countwords( sys.stdin.read() ) s = map(str, x) t = string.joinfields(s, "\n") print t

Extracting People Names and Company Names


import string, sre, glob, sys def createNameDict(): dictNameFile=open('names.txt','r') dictContent=dictNameFile.read() #read all the file dictWords=dictContent.split(",") #return a list with the words nameDict={} # initialize a dictionary for word in dictWords: nameDict[word.strip()]=" " #enters each word to the dctionary. return nameDict

def main(): # read file fname1 = sys.argv[1] text1 = open(fname1,'r').read() namesDic = createNameDict() CompanySuffix = sre.compile(r'corp | ltd | inc | corporation | gmbh | ag | sa ', sre.IGNORECASE) pattern = sre.compile( r'([A-Z]\w+[ .,-]+)+'

Extracting People Names and Company Names


r'(corp|CORP|Corp|ltd|Ltd|LTD|inc|Inc|INC|corporation|Corporation|CORPORATION|gmbh| GMBH|ag|AG|sa|SA)' r'(\.?)') pattern1 = sre.compile( r'([A-Z]\w+[\s.-]*){2,4}' ) #Companies capitalWords=sre.finditer(pattern,text1) for match in capitalWords: CapSeq = match.group() print CapSeq #People capitalWords1=sre.finditer(pattern1,text1) for match in capitalWords1: wordList=match.group().split() #check name in names dictionary if namesDic.has_key(wordList[0].strip()): print match.group() if __name__ == '__main__': main()

NLTK
NLTK defines a basic infrastructure that can be used

to build NLP programs in Python. It provides:


Basic classes for representing data relevant to natural language processing. Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. Standard implementations for each task, which can be combined to solve complex problems. Extensive documentation, including tutorials and reference documentation.

RE Show
>>> from nltk.util import re_show >>> string = """ ... Its probably worth paying a premium for funds that invest in markets ... that are partially closed to foreign investors, such as South Korea, ... ... """ >>> re_show(t..., string) I{ts }probably wor{th p}aying a premium for funds {that} inves{t in} markets {that} are par{tial}ly closed {to f}oreign inves{tors}, such as Sou{th K}orea, ... >>>

Classes in Python

Defining Classes
>>> class SimpleClass: ... def __init__(self, initial_value): ... self.data = initial_value ... def set(self, value): ... self.data = value ... def get(self): ... print self.data ... >>> x = SimpleClass(4)

Inheritance
B is a subclass of A >>> class B(A): ... def __init__(self):

SimpleTokenizer implements the interface of TokenizerI >>> class SimpleTokenizer(TokenizerI): ... def tokenize(self, str): ... words = str.split() ... return [Token(words[i], Location(i)) ... for i in range(len(words))]

Inheritance Example
class point: def __init__(self, x=0, y=0): self.x, self.y = x, y

class cartesian(point): def distanceToOrigin(self): return floor(sqrt(self.x**2 + self.y**2)) class manhattan(point): def distanceToOrigin(self): return self.x + self.y

Sets

Sets in Python
The sets module provides classes for constructing

and manipulating unordered collections of unique elements. Common uses include:


membership testing, removing duplicates from a sequence, and computing standard math operations on sets such as intersection, union, difference, and symmetric difference.

Like other collections, sets support x in set, len(set),

and for x in set. Being an unordered collection, sets do not record element position or order of insertion. Accordingly, sets do not support indexing, slicing, or other sequence-like behavior.

Some Details about Implementation


Most set applications use the Set class which

provides every set method except for __hash__(). For advanced applications requiring a hash method, the ImmutableSet class adds a __hash__() method but omits methods which alter the contents of the set. The set classes are implemented using dictionaries. As a result, sets cannot contain mutable elements such as lists or dictionaries. However, they can contain immutable collections such as tuples or instances of ImmutableSet. For convenience in implementing sets of sets, inner sets are automatically converted to immutable form, for example, Set([Set(['dog'])]) is transformed to Set([ImmutableSet(['dog'])]).

Set Operations
Operation
len(s)

Equivalent

Result

cardinality of set s
test x for membership in s test x for non-membership in s
s <= t s >= t

x in s x not in s s.issubset(t) s.issuperset(t) s.union(t) s.intersection(t) s.difference(t) s.symmetric_differenc e(t) s.copy()

test whether every element in s is in t


test whether every element in t is in s new set with elements from both s and t new set with elements common to s and t new set with elements in s but not in t new set with elements in either s or t but not both new set with a shallow copy of s

s|t s&t s-t s^t

Operations not for ImmutableSet


Operation
s.union_update( t) s.intersection_u pdate(t) s.difference_up date(t) s.symmetric_dif
ference_up date(t)

Equivalent

Result

s |= t
s &= t s -= t s ^= t

return set s with elements added from t


return set s keeping only elements also found in t return set s after removing elements found in t return set s with elements from s or t but not both add element x to set s

s.add(x)

s.remove(x)
s.discard(x) s.pop()

remove x from set s; raises KeyError if not present


removes x from set s if present remove and return an arbitrary element from s; raises KeyError if empty

Set Examples
>>> from sets import Set >>> engineers = Set(['John', 'Jane', 'Jack', 'Janice']) >>> programmers = Set(['Jack', 'Sam', 'Susan', 'Janice']) >>> managers = Set(['Jane', 'Jack', 'Susan', 'Zack']) >>> employees = engineers | programmers | managers # union >>> engineering_management = engineers & managers # intersection >>> fulltime_management = managers - engineers - programmers # difference >>> engineers.add('Marvin') # add element >>> print engineers Set(['Jane', 'Marvin', 'Janice', 'John', 'Jack']) >>> employees.issuperset(engineers) # superset test False

Set Examples
>>> employees.union_update(engineers) # update from another set >>> employees.issuperset(engineers) True >>> for group in [engineers, programmers, managers, employees]: ... group.discard('Susan') # unconditionally remove element ... print group ... Set(['Jane', 'Marvin', 'Janice', 'John', 'Jack']) Set(['Janice', 'Jack', 'Sam']) Set(['Jane', 'Zack', 'Jack']) Set(['Jack', 'Sam', 'Jane', 'Marvin', 'Janice', 'John', 'Zack'])

Google API
Get it from

http://sourceforge.net/projects/pygoogle/ A Python wrapper for the Google web API. Allows you to do Google searches, retrieve pages from the Google cache, and ask Google for spelling suggestions.

Utilizing the Google API - I


import sys import string import codecs import google print '<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">' print '<head>' print ' <title>Google with Python</title>' print '</head>' print '<body>'

print '<h1>Google with Python</h1>'


google.LICENSE_KEY = '[YOUR GOOGLE LICENSE KEY]' sys.stdout = codecs.lookup('utf-8')[-1](sys.stdout) query = Your Query" data = google.doGoogleSearch(query)

Utilizing the Google API - II


print '<p><strong>1-10 of "' + query + '" total results for ' print str(data.meta.estimatedTotalResultsCount) + '</strong></p>' for result in data.results: title = result.title title = title.replace('<b>', '<strong>') title = title.replace('</b>', '</strong>')

snippet = result.snippet snippet = snippet.replace('<b>','<strong>') snippet = snippet.replace('</b>','</strong>') snippet = snippet.replace('<br>','<br />') print '<h2><a href="' + result.URL + '">' + title + '</a></h2>' print '<p>' + snippet + '</p>' print '</body> print '</html>'

Yahoo API
http://pysearch.sourceforge.net/

http://python.codezoo.com/pub/component/41

93?category=198 This project implements a Python API for the Yahoo Search Webservices API. pYsearch is an OO abstraction of the web services, with emphasis on ease of use and extensibility.

URLLIB
This module provides a high-level interface

for fetching data across the World Wide Web. In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames. Some restrictions apply -- it can only open URLs for reading, and no seek operations are available.

Urllib Syntax
# Use http://www.someproxy.com:3128 for http

proxying proxies = {'http': 'http://www.someproxy.com:3128'} filehandle = urllib.urlopen(some_url, proxies=proxies) # Don't use any proxies filehandle = urllib.urlopen(some_url, proxies={})

URLLIB Examples
Here is an example session that uses the "GET" method to

retrieve a URL containing parameters: >>> import urllib >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params) >>> print f.read() The following example uses the "POST" method instead: >>> import urllib >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params) >>> print f.read()

What is a Proxy
A proxy server is a computer that offers a computer

network service to allow clients to make indirect network connections to other network services. A client connects to the proxy server, then requests a connection, file, or other resource available on a different server. The proxy provides the resource either by connecting to the specified server or by serving it from a cache. In some cases, the proxy may alter the client's request or the server's response for various purposes. A proxy server can also serve as a firewall.

You might also like