Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (3 votes)
65 views

Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python 2nd Edition Akshay Kulkarni - The ebook with all chapters is available with just one click

The document promotes the ebook 'Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python, 2nd Edition' by Akshay Kulkarni and Adarsha Shivananda, available for download at ebookmeta.com. It includes various recipes for data extraction, processing, and analysis using Python, along with links to other related ebooks. The content emphasizes practical applications of machine learning and deep learning in natural language processing.

Uploaded by

deylisnueur87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
65 views

Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python 2nd Edition Akshay Kulkarni - The ebook with all chapters is available with just one click

The document promotes the ebook 'Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python, 2nd Edition' by Akshay Kulkarni and Adarsha Shivananda, available for download at ebookmeta.com. It includes various recipes for data extraction, processing, and analysis using Python, along with links to other related ebooks. The content emphasizes practical applications of machine learning and deep learning in natural language processing.

Uploaded by

deylisnueur87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Read Anytime Anywhere Easy Ebook Downloads at ebookmeta.

com

Natural Language Processing Recipes: Unlocking


Text Data with Machine Learning and Deep Learning
Using Python 2nd Edition Akshay Kulkarni

https://ebookmeta.com/product/natural-language-processing-
recipes-unlocking-text-data-with-machine-learning-and-deep-
learning-using-python-2nd-edition-akshay-kulkarni-2/

OR CLICK HERE

DOWLOAD EBOOK

Visit and Get More Ebook Downloads Instantly at https://ebookmeta.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Natural Language Processing Recipes: Unlocking Text Data


with Machine Learning and Deep Learning Using Python 2nd
Edition Akshay Kulkarni
https://ebookmeta.com/product/natural-language-processing-recipes-
unlocking-text-data-with-machine-learning-and-deep-learning-using-
python-2nd-edition-akshay-kulkarni/
ebookmeta.com

Time Series Algorithms Recipes: Implement Machine Learning


and Deep Learning Techniques with Python 1st Edition
Akshay Kulkarni
https://ebookmeta.com/product/time-series-algorithms-recipes-
implement-machine-learning-and-deep-learning-techniques-with-
python-1st-edition-akshay-kulkarni/
ebookmeta.com

Natural Language Processing Projects: Build Next-


Generation NLP Applications Using AI Techniques Akshay
Kulkarni
https://ebookmeta.com/product/natural-language-processing-projects-
build-next-generation-nlp-applications-using-ai-techniques-akshay-
kulkarni/
ebookmeta.com

Aspects of Value Frederick Charles Gruber (Editor)

https://ebookmeta.com/product/aspects-of-value-frederick-charles-
gruber-editor/

ebookmeta.com
Paul Miriam The Yoder Sisters Mail Order Brides Book 6 1st
Edition M K Moore

https://ebookmeta.com/product/paul-miriam-the-yoder-sisters-mail-
order-brides-book-6-1st-edition-m-k-moore-2/

ebookmeta.com

Computer Vision ECCV 2020 16th European Conference Glasgow


UK August 23 28 2020 Proceedings Part VI Andrea Vedaldi

https://ebookmeta.com/product/computer-vision-eccv-2020-16th-european-
conference-glasgow-uk-august-23-28-2020-proceedings-part-vi-andrea-
vedaldi/
ebookmeta.com

The Terrorism Survival Guide 201 Travel Tips on How Not to


Become a Victim Revised and Updated Andy Lightbody

https://ebookmeta.com/product/the-terrorism-survival-guide-201-travel-
tips-on-how-not-to-become-a-victim-revised-and-updated-andy-lightbody/

ebookmeta.com

Yoga Of Gita Expounded By Saint Dnyaneshwar Inner Secrets


Of Rajayoga Saint Dnyaneshwar On Kundalini Yoga Practice
VOL 2 1st Edition Vibhakar Vitthal Lele
https://ebookmeta.com/product/yoga-of-gita-expounded-by-saint-
dnyaneshwar-inner-secrets-of-rajayoga-saint-dnyaneshwar-on-kundalini-
yoga-practice-vol-2-1st-edition-vibhakar-vitthal-lele/
ebookmeta.com

Dynamics Information and Complexity in Quantum Systems 2nd


Edition Fabio Benatti

https://ebookmeta.com/product/dynamics-information-and-complexity-in-
quantum-systems-2nd-edition-fabio-benatti/

ebookmeta.com
Inside Wikipedia How It Works and How You Can Be an Editor
Paul A Thomas

https://ebookmeta.com/product/inside-wikipedia-how-it-works-and-how-
you-can-be-an-editor-paul-a-thomas/

ebookmeta.com
Natural Language
Processing
Recipes
Unlocking Text Data with Machine Learning
and Deep Learning Using Python

Second Edition

Akshay Kulkarni
Adarsha Shivananda
Natural Language
Processing Recipes
Unlocking Text Data with
Machine Learning and Deep Learning
Using Python
Second Edition

Akshay Kulkarni
Adarsha Shivananda
Natural Language Processing Recipes: Unlocking Text Data with Machine Learning
and Deep Learning Using Python
Akshay Kulkarni Adarsha Shivananda
Bangalore, Karnataka, India Bangalore, Karnataka, India

ISBN-13 (pbk): 978-1-4842-7350-0     ISBN-13 (electronic): 978-1-4842-7351-7


https://doi.org/10.1007/978-1-4842-7351-7

Copyright © 2021 by Akshay Kulkarni and Adarsha Shivananda


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the
trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not
identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Celestin Suresh John
Development Editor: Laura Berendson
Coordinating Editor: Shrikant Vishwakarma
Cover designed by eStudioCalamar
Cover image designed by Pexels
Distributed to the book trade worldwide by Springer Science+Business Media LLC, 1 New York Plaza, Suite
4600, New York, NY 10004. Phone 1-800-SPRINGER, fax (201) 348-4505, email orders-ny@springer-sbm.
com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner)
is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware
corporation.
For information on translations, please e-mail booktranslations@springernature.com; for reprint, paperback,
or audio rights, please e-mail bookpermissions@springernature.com, or visit http://www.apress.com/
rights-permissions.
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and
licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales
web page at http://www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to
readers on GitHub via the book’s product page, located at www.apress.com/978-­1-­4842-­7350-­0. For more
detailed information, please visit http://www.apress.com/source-­code.
Printed on acid-free paper
To our families
Table of Contents
About the Authors����������������������������������������������������������������������������������������������������xv

About the Technical Reviewer�������������������������������������������������������������������������������xvii


Acknowledgments��������������������������������������������������������������������������������������������������xix

Introduction������������������������������������������������������������������������������������������������������������xxi

Chapter 1: Extracting the Data��������������������������������������������������������������������������������� 1


Introduction����������������������������������������������������������������������������������������������������������������������������������� 1
Client Data������������������������������������������������������������������������������������������������������������������������������������ 1
Free Sources��������������������������������������������������������������������������������������������������������������������������������� 2
Web Scraping�������������������������������������������������������������������������������������������������������������������������������� 2
Recipe 1-1. Collecting Data���������������������������������������������������������������������������������������������������������� 2
Problem����������������������������������������������������������������������������������������������������������������������������������� 3
Solution����������������������������������������������������������������������������������������������������������������������������������� 3
How It Works��������������������������������������������������������������������������������������������������������������������������� 3
Recipe 1-2. Collecting Data from PDFs����������������������������������������������������������������������������������������� 4
Problem����������������������������������������������������������������������������������������������������������������������������������� 4
Solution����������������������������������������������������������������������������������������������������������������������������������� 5
How It Works��������������������������������������������������������������������������������������������������������������������������� 5
Recipe 1-3. Collecting Data from Word Files�������������������������������������������������������������������������������� 6
Problem����������������������������������������������������������������������������������������������������������������������������������� 6
Solution����������������������������������������������������������������������������������������������������������������������������������� 6
How It Works��������������������������������������������������������������������������������������������������������������������������� 6
Recipe 1-4. Collecting Data from JSON���������������������������������������������������������������������������������������� 7
Problem����������������������������������������������������������������������������������������������������������������������������������� 7
Solution����������������������������������������������������������������������������������������������������������������������������������� 7
How It Works��������������������������������������������������������������������������������������������������������������������������� 8

v
Table of Contents

Recipe 1-5. Collecting Data from HTML�������������������������������������������������������������������������������������� 10


Problem��������������������������������������������������������������������������������������������������������������������������������� 10
Solution��������������������������������������������������������������������������������������������������������������������������������� 10
How It Works������������������������������������������������������������������������������������������������������������������������� 10
Recipe 1-6. Parsing Text Using Regular Expressions������������������������������������������������������������������ 13
Problem��������������������������������������������������������������������������������������������������������������������������������� 13
Solution��������������������������������������������������������������������������������������������������������������������������������� 13
How It Works������������������������������������������������������������������������������������������������������������������������� 14
Recipe 1-7. Handling Strings������������������������������������������������������������������������������������������������������ 21
Problem��������������������������������������������������������������������������������������������������������������������������������� 21
Solution��������������������������������������������������������������������������������������������������������������������������������� 21
How It Works������������������������������������������������������������������������������������������������������������������������� 22
Recipe 1-8. Scraping Text from the Web������������������������������������������������������������������������������������� 23
Problem��������������������������������������������������������������������������������������������������������������������������������� 24
Solution��������������������������������������������������������������������������������������������������������������������������������� 24
How It Works������������������������������������������������������������������������������������������������������������������������� 24

Chapter 2: Exploring and Processing Text Data������������������������������������������������������ 31


Recipe 2-1. Converting Text Data to Lowercase������������������������������������������������������������������������� 32
Problem��������������������������������������������������������������������������������������������������������������������������������� 32
Solution��������������������������������������������������������������������������������������������������������������������������������� 32
How It Works������������������������������������������������������������������������������������������������������������������������� 32
Recipe 2-2. Removing Punctuation��������������������������������������������������������������������������������������������� 34
Problem��������������������������������������������������������������������������������������������������������������������������������� 34
Solution��������������������������������������������������������������������������������������������������������������������������������� 34
How It Works������������������������������������������������������������������������������������������������������������������������� 34
Recipe 2-3. Removing Stop Words���������������������������������������������������������������������������������������������� 36
Problem��������������������������������������������������������������������������������������������������������������������������������� 36
Solution��������������������������������������������������������������������������������������������������������������������������������� 36
How It Works������������������������������������������������������������������������������������������������������������������������� 37

vi
Table of Contents

Recipe 2-4. Standardizing Text��������������������������������������������������������������������������������������������������� 38


Problem��������������������������������������������������������������������������������������������������������������������������������� 38
Solution��������������������������������������������������������������������������������������������������������������������������������� 38
How It Works������������������������������������������������������������������������������������������������������������������������� 39
Recipe 2-5. Correcting Spelling�������������������������������������������������������������������������������������������������� 40
Problem��������������������������������������������������������������������������������������������������������������������������������� 40
Solution��������������������������������������������������������������������������������������������������������������������������������� 40
How It Works������������������������������������������������������������������������������������������������������������������������� 40
Recipe 2-6. Tokenizing Text�������������������������������������������������������������������������������������������������������� 42
Problem��������������������������������������������������������������������������������������������������������������������������������� 42
Solution��������������������������������������������������������������������������������������������������������������������������������� 42
How It Works������������������������������������������������������������������������������������������������������������������������� 42
Recipe 2-7. Stemming���������������������������������������������������������������������������������������������������������������� 44
Problem��������������������������������������������������������������������������������������������������������������������������������� 44
Solution��������������������������������������������������������������������������������������������������������������������������������� 44
How It Works������������������������������������������������������������������������������������������������������������������������� 44
Recipe 2-8. Lemmatizing������������������������������������������������������������������������������������������������������������ 45
Problem��������������������������������������������������������������������������������������������������������������������������������� 46
Solution��������������������������������������������������������������������������������������������������������������������������������� 46
How It Works������������������������������������������������������������������������������������������������������������������������� 46
Recipe 2-9. Exploring Text Data�������������������������������������������������������������������������������������������������� 47
Problem��������������������������������������������������������������������������������������������������������������������������������� 47
Solution��������������������������������������������������������������������������������������������������������������������������������� 47
How It Works������������������������������������������������������������������������������������������������������������������������� 48
Recipe 2-10. Dealing with Emojis and Emoticons���������������������������������������������������������������������� 52
Problem��������������������������������������������������������������������������������������������������������������������������������� 53
Solution��������������������������������������������������������������������������������������������������������������������������������� 53
How It Works������������������������������������������������������������������������������������������������������������������������� 53
Problem��������������������������������������������������������������������������������������������������������������������������������� 54
Solution��������������������������������������������������������������������������������������������������������������������������������� 54
How It Works������������������������������������������������������������������������������������������������������������������������� 54

vii
Table of Contents

Problem��������������������������������������������������������������������������������������������������������������������������������� 55
Solution��������������������������������������������������������������������������������������������������������������������������������� 55
How It Works������������������������������������������������������������������������������������������������������������������������� 55
Problem��������������������������������������������������������������������������������������������������������������������������������� 56
Solution��������������������������������������������������������������������������������������������������������������������������������� 56
How It Works������������������������������������������������������������������������������������������������������������������������� 57
Problem��������������������������������������������������������������������������������������������������������������������������������� 58
Solution��������������������������������������������������������������������������������������������������������������������������������� 58
How It Works������������������������������������������������������������������������������������������������������������������������� 58
Recipe 2-11. Building a Text Preprocessing Pipeline������������������������������������������������������������������ 59
Problem��������������������������������������������������������������������������������������������������������������������������������� 59
Solution��������������������������������������������������������������������������������������������������������������������������������� 59
How It Works������������������������������������������������������������������������������������������������������������������������� 60

Chapter 3: Converting Text to Features������������������������������������������������������������������ 63


Recipe 3-1. Converting Text to Features Using One-­Hot Encoding��������������������������������������������� 64
Problem��������������������������������������������������������������������������������������������������������������������������������� 64
Solution��������������������������������������������������������������������������������������������������������������������������������� 64
How It Works������������������������������������������������������������������������������������������������������������������������� 64
Recipe 3-2. Converting Text to Features Using a Count Vectorizer��������������������������������������������� 65
Problem��������������������������������������������������������������������������������������������������������������������������������� 65
Solution��������������������������������������������������������������������������������������������������������������������������������� 66
How It Works������������������������������������������������������������������������������������������������������������������������� 66
Recipe 3-3. Generating n-grams������������������������������������������������������������������������������������������������� 67
Problem��������������������������������������������������������������������������������������������������������������������������������� 67
Solution��������������������������������������������������������������������������������������������������������������������������������� 67
How It Works������������������������������������������������������������������������������������������������������������������������� 68
Recipe 3-4. Generating a Co-occurrence Matrix������������������������������������������������������������������������� 69
Problem��������������������������������������������������������������������������������������������������������������������������������� 69
Solution��������������������������������������������������������������������������������������������������������������������������������� 70
How It Works������������������������������������������������������������������������������������������������������������������������� 70

viii
Table of Contents

Recipe 3-5. Hash Vectorizing������������������������������������������������������������������������������������������������������ 72


Problem��������������������������������������������������������������������������������������������������������������������������������� 72
Solution��������������������������������������������������������������������������������������������������������������������������������� 72
How It Works������������������������������������������������������������������������������������������������������������������������� 72
Recipe 3-6. Converting Text to Features Using TF-­IDF���������������������������������������������������������������� 73
Problem��������������������������������������������������������������������������������������������������������������������������������� 73
Solution��������������������������������������������������������������������������������������������������������������������������������� 73
How It Works������������������������������������������������������������������������������������������������������������������������� 74
Recipe 3-7. Implementing Word Embeddings����������������������������������������������������������������������������� 75
Problem��������������������������������������������������������������������������������������������������������������������������������� 76
Solution��������������������������������������������������������������������������������������������������������������������������������� 77
How It Works������������������������������������������������������������������������������������������������������������������������� 77
Recipe 3-8. Implementing fastText��������������������������������������������������������������������������������������������� 84
Problem��������������������������������������������������������������������������������������������������������������������������������� 84
Solution��������������������������������������������������������������������������������������������������������������������������������� 84
How It Works������������������������������������������������������������������������������������������������������������������������� 84
Recipe 3-9. Converting Text to Features Using State-­of-­the-Art Embeddings���������������������������� 87
Problem��������������������������������������������������������������������������������������������������������������������������������� 87
Solution��������������������������������������������������������������������������������������������������������������������������������� 87
ELMo�������������������������������������������������������������������������������������������������������������������������������������� 88
Sentence Encoders���������������������������������������������������������������������������������������������������������������� 89
Open-AI GPT�������������������������������������������������������������������������������������������������������������������������� 91
How It Works������������������������������������������������������������������������������������������������������������������������� 91

Chapter 4: Advanced Natural Language Processing��������������������������������������������� 107


Recipe 4-1. Extracting Noun Phrases��������������������������������������������������������������������������������������� 109
Problem������������������������������������������������������������������������������������������������������������������������������� 109
Solution������������������������������������������������������������������������������������������������������������������������������� 109
How It Works����������������������������������������������������������������������������������������������������������������������� 109
Recipe 4-2. Finding Similarity Between Texts��������������������������������������������������������������������������� 110
Solution������������������������������������������������������������������������������������������������������������������������������� 110
How It Works����������������������������������������������������������������������������������������������������������������������� 110

ix
Table of Contents

Recipe 4-3. Tagging Part of Speech������������������������������������������������������������������������������������������ 113


Problem������������������������������������������������������������������������������������������������������������������������������� 113
Solution������������������������������������������������������������������������������������������������������������������������������� 113
How It Works����������������������������������������������������������������������������������������������������������������������� 113
Recipe 4-4. Extracting Entities from Text���������������������������������������������������������������������������������� 116
Problem������������������������������������������������������������������������������������������������������������������������������� 116
Solution������������������������������������������������������������������������������������������������������������������������������� 116
How It Works����������������������������������������������������������������������������������������������������������������������� 116
Recipe 4-5. Extracting Topics from Text������������������������������������������������������������������������������������ 118
Problem������������������������������������������������������������������������������������������������������������������������������� 118
Solution������������������������������������������������������������������������������������������������������������������������������� 118
How It Works����������������������������������������������������������������������������������������������������������������������� 118
Recipe 4-6. Classifying Text������������������������������������������������������������������������������������������������������ 121
Problem������������������������������������������������������������������������������������������������������������������������������� 121
Solution������������������������������������������������������������������������������������������������������������������������������� 121
How It Works����������������������������������������������������������������������������������������������������������������������� 122
Recipe 4-7. Carrying Out Sentiment Analysis��������������������������������������������������������������������������� 125
Problem������������������������������������������������������������������������������������������������������������������������������� 125
Solution������������������������������������������������������������������������������������������������������������������������������� 125
How It Works����������������������������������������������������������������������������������������������������������������������� 125
Recipe 4-8. Disambiguating Text���������������������������������������������������������������������������������������������� 127
Problem������������������������������������������������������������������������������������������������������������������������������� 127
Solution������������������������������������������������������������������������������������������������������������������������������� 127
How It Works����������������������������������������������������������������������������������������������������������������������� 127
Recipe 4-9. Converting Speech to Text������������������������������������������������������������������������������������� 128
Problem������������������������������������������������������������������������������������������������������������������������������� 129
Solution������������������������������������������������������������������������������������������������������������������������������� 129
How It Works����������������������������������������������������������������������������������������������������������������������� 129
Recipe 4-10. Converting Text to Speech����������������������������������������������������������������������������������� 131
Problem������������������������������������������������������������������������������������������������������������������������������� 131

x
Table of Contents

Solution������������������������������������������������������������������������������������������������������������������������������� 131
How It Works����������������������������������������������������������������������������������������������������������������������� 131
Recipe 4-11. Translating Speech���������������������������������������������������������������������������������������������� 132
Problem������������������������������������������������������������������������������������������������������������������������������� 132
Solution������������������������������������������������������������������������������������������������������������������������������� 132
How It Works����������������������������������������������������������������������������������������������������������������������� 132

Chapter 5: Implementing Industry Applications��������������������������������������������������� 135


Recipe 5-1. Implementing Multiclass Classification����������������������������������������������������������������� 135
Problem������������������������������������������������������������������������������������������������������������������������������� 136
Solution������������������������������������������������������������������������������������������������������������������������������� 136
How It Works����������������������������������������������������������������������������������������������������������������������� 136
Recipe 5-2. Implementing Sentiment Analysis������������������������������������������������������������������������� 143
Problem������������������������������������������������������������������������������������������������������������������������������� 143
Solution������������������������������������������������������������������������������������������������������������������������������� 143
How It Works����������������������������������������������������������������������������������������������������������������������� 143
Recipe 5-3. Applying Text Similarity Functions������������������������������������������������������������������������� 154
Problem������������������������������������������������������������������������������������������������������������������������������� 154
Solution������������������������������������������������������������������������������������������������������������������������������� 155
How It Works����������������������������������������������������������������������������������������������������������������������� 155
Recipe 5-4. Summarizing Text Data������������������������������������������������������������������������������������������ 165
Problem������������������������������������������������������������������������������������������������������������������������������� 166
Solution������������������������������������������������������������������������������������������������������������������������������� 166
How It Works����������������������������������������������������������������������������������������������������������������������� 166
Recipe 5-5. Clustering Documents������������������������������������������������������������������������������������������� 172
Problem������������������������������������������������������������������������������������������������������������������������������� 172
Solution������������������������������������������������������������������������������������������������������������������������������� 172
How It Works����������������������������������������������������������������������������������������������������������������������� 172
Recipe 5-6. NLP in a Search Engine����������������������������������������������������������������������������������������� 178
Problem������������������������������������������������������������������������������������������������������������������������������� 178
Solution������������������������������������������������������������������������������������������������������������������������������� 178
How It Works����������������������������������������������������������������������������������������������������������������������� 179

xi
Table of Contents

Recipe 5-7. Detecting Fake News��������������������������������������������������������������������������������������������� 181


Problem������������������������������������������������������������������������������������������������������������������������������� 181
Solution������������������������������������������������������������������������������������������������������������������������������� 182
How It Works����������������������������������������������������������������������������������������������������������������������� 182
Recipe 5-8. Movie Genre Tagging��������������������������������������������������������������������������������������������� 195
Problem������������������������������������������������������������������������������������������������������������������������������� 195
Solution������������������������������������������������������������������������������������������������������������������������������� 196
How It Works����������������������������������������������������������������������������������������������������������������������� 197

Chapter 6: Deep Learning for NLP������������������������������������������������������������������������ 213


Introduction to Deep Learning�������������������������������������������������������������������������������������������������� 213
Convolutional Neural Networks������������������������������������������������������������������������������������������������� 215
Data������������������������������������������������������������������������������������������������������������������������������������������� 215
Architecture������������������������������������������������������������������������������������������������������������������������������ 216
Convolution������������������������������������������������������������������������������������������������������������������������������� 216
Nonlinearity (ReLU)������������������������������������������������������������������������������������������������������������������� 216
Pooling�������������������������������������������������������������������������������������������������������������������������������������� 217
Flatten, Fully Connected, and Softmax Layers�������������������������������������������������������������������������� 217
Backpropagation: Training the Neural Network������������������������������������������������������������������������ 218
Recurrent Neural Networks������������������������������������������������������������������������������������������������������ 218
Training RNN: Backpropagation Through Time (BPTT)�������������������������������������������������������������� 219
Long Short-Term Memory (LSTM)��������������������������������������������������������������������������������������������� 219
Recipe 6-1. Retrieving Information������������������������������������������������������������������������������������������� 220
Problem������������������������������������������������������������������������������������������������������������������������������� 221
Solution������������������������������������������������������������������������������������������������������������������������������� 221
How It Works����������������������������������������������������������������������������������������������������������������������� 222
Recipe 6-2. Classifying Text with Deep Learning���������������������������������������������������������������������� 227
Problem������������������������������������������������������������������������������������������������������������������������������� 227
Solution������������������������������������������������������������������������������������������������������������������������������� 227
How It Works����������������������������������������������������������������������������������������������������������������������� 228

xii
Table of Contents

Recipe 6-3. Next Word Prediction��������������������������������������������������������������������������������������������� 240


Problem������������������������������������������������������������������������������������������������������������������������������� 241
Solution������������������������������������������������������������������������������������������������������������������������������� 241
How It Works����������������������������������������������������������������������������������������������������������������������� 241
Recipe 6-4. Stack Overflow question recommendation������������������������������������������������������������ 248
Problem������������������������������������������������������������������������������������������������������������������������������� 249
Solution������������������������������������������������������������������������������������������������������������������������������� 249
How It Works����������������������������������������������������������������������������������������������������������������������� 249

Chapter 7: Conclusion and Next-Gen NLP������������������������������������������������������������� 263


Recipe 7-1. Recent advancements in text to features or distributed representations������������� 265
Problem������������������������������������������������������������������������������������������������������������������������������� 265
Solution������������������������������������������������������������������������������������������������������������������������������� 265
Recipe 7-2. Advanced deep learning for NLP��������������������������������������������������������������������������� 265
Problem������������������������������������������������������������������������������������������������������������������������������� 265
Solution������������������������������������������������������������������������������������������������������������������������������� 265
Recipe 7-3. Reinforcement learning applications in NLP��������������������������������������������������������� 266
Problem������������������������������������������������������������������������������������������������������������������������������� 266
Solution������������������������������������������������������������������������������������������������������������������������������� 266
Recipe 7-4. Transfer learning and pre-trained models������������������������������������������������������������� 267
Problem������������������������������������������������������������������������������������������������������������������������������� 267
Solution������������������������������������������������������������������������������������������������������������������������������� 268
Recipe 7-5. Meta-learning in NLP��������������������������������������������������������������������������������������������� 273
Problem������������������������������������������������������������������������������������������������������������������������������� 273
Solution������������������������������������������������������������������������������������������������������������������������������� 273
Recipe 7-6. Capsule networks for NLP������������������������������������������������������������������������������������� 274
Problem������������������������������������������������������������������������������������������������������������������������������� 274
Solution������������������������������������������������������������������������������������������������������������������������������� 274

Index��������������������������������������������������������������������������������������������������������������������� 277

xiii
About the Authors
Akshay Kulkarni is a renowned AI and machine learning
evangelist and thought leader. He has consulted several
Fortune 500 and global enterprises on driving AI and
data science–led strategic transformation. Akshay has
rich experience in building and scaling AI and machine
learning businesses and creating significant impact. He
is currently a data science and AI manager at Publicis
Sapient, where he is part of strategy and transformation
interventions through AI. He manages high-priority
growth initiatives around data science and works on
various artificial intelligence engagements by applying
state-of-the-art techniques to this space.
Akshay is also a Google Developers Expert in machine learning, a published author
of books on NLP and deep learning, and a regular speaker at major AI and data science
conferences.
In 2019, Akshay was named one of the top “40 under 40 data scientists” in India.
In his spare time, he enjoys reading, writing, coding, and mentoring aspiring data
scientists. He lives in Bangalore, India, with his family.

Adarsha Shivananda is a lead data scientist at Indegene


Inc.’s product and technology team, where he leads a
group of analysts who enable predictive analytics and AI
features to healthcare software products. These are mainly
multichannel activities for pharma products and solving
the real-time problems encountered by pharma sales reps.
Adarsha aims to build a pool of exceptional data scientists
within the organization to solve greater health care problems
through brilliant training programs. He always wants to stay
ahead of the curve.

xv
About the Authors

His core expertise involves machine learning, deep learning, recommendation


systems, and statistics. Adarsha has worked on various data science projects across
multiple domains using different technologies and methodologies. Previously, he
worked for Tredence Analytics and IQVIA.
He lives in Bangalore, India, and loves to read, ride, and teach data science.

xvi
About the Technical Reviewer
Aakash Kag is a data scientist at AlixPartners and is a
co-founder of the Emeelan application. He has six years
of experience in big data analytics and has a postgraduate
degree in computer science with a specialization in big data
analytics. Aakash is passionate about developing social
platforms, machine learning, and meetups, where he often
talks.

xvii
Acknowledgments
We are grateful to our families for their motivation and constant support.
We want to express our gratitude to out mentors and friends for their input,
inspiration, and support. A special thanks to Anoosh R. Kulkarni, a data scientist at
Quantziq, for his support in writing this book and his technical input. A big thanks to the
Apress team for their constant support and help.
Finally, we would like to thank you, the reader, for showing an interest in this book
and making your natural language processing journey more exciting.
Note that the views and opinions expressed in this book are those of the authors.

xix
Introduction
According to industry estimates, more than 80% of the data being generated is in an
unstructured format in the form of text, images, audio, or video. Data is being generated
as we speak, write, tweet, use social media platforms, send messages on messaging
platforms, use ecommerce to shop, and do various other activities. The majority of this
data exists in textual form.

So, what is unstructured data? Unstructured data is information that doesn't reside
in a traditional relational database. Examples include documents, blogs, social media
feeds, pictures, and videos.
Most of the insights are locked within different types of unstructured data. Unlocking
unstructured data plays a vital role in every organization wanting to make improved and
better decisions. This book unlocks the potential of textual data.
Textual data is the most common and comprises more than 50% of unstructured
data. Examples include tweets/posts on social media, chat conversations, news, blogs,
articles, product or services reviews, and patient records in the healthcare sector. Recent
examples include voice-driven bots like Siri and Alexa.

xxi
Introduction

To retrieve significant and actionable insights from textual data and unlock its
potential, we use natural language processing coupled with machine learning and deep
learning.
But what is natural language processing? Machines and algorithms do not
understand text or characters, so it is very important to convert textual data into
a machine-understandable format (like numbers or binary) to analyze it. Natural
language processing (NLP) allows machines to understand and interpret the human
language.
If you want to use the power of unstructured text, this book is the right starting point.
This book unearths the concepts and implementation of natural language processing
and its applications in the real world. NLP offers unbounded opportunities for solving
interesting problems in artificial intelligence, making it the latest frontier for developing
intelligent, deep learning–based applications.

What Does This Book Cover?


Natural Language Processing Recipes is a handy problem/solution reference for learning
and implementing NLP solutions using Python. The book is packed with lots of code
and approaches that help you quickly learn and implement both basic and advanced
NLP techniques. You will learn how to efficiently use a wide range of NLP packages,
implement text classification, and identify parts of speech. You also learn about topic
modeling, text summarization, text generation, sentiment analysis, and many other NLP
applications.
This new edition of Natural Language Processing Recipes focuses on implementing
end-to-end projects using Python and leveraging cutting-edge algorithms and transfer
learning.
The book begins by discussing text data collections, web scraping, and different
types of data sources. You learn how to clean and preprocess text data and analyze
it using advanced algorithms. Throughout the book, you explore the semantic as
well as syntactic analysis of text. It covers complex NLP solutions that involve text
normalization, various advanced preprocessing methods, part-of-speech (POS)
tagging, parsing, text summarization, sentiment analysis, topic modeling, named-entity
recognition (NER), word2vec, seq2seq, and more.

xxii
Introduction

The book covers both fundamental and state-of-the-art techniques used in machine
learning applications and deep learning natural language processing. This edition
includes various advanced techniques to convert text to features, like GloVe, ELMo,
and BERT. It also explains how transformers work, using Sentence-BERT and GPT as
examples.
The book closes by discussing some of the advanced industrial applications of
NLP with a solution approach and implementation, also leveraging the power of deep
learning techniques for natural language processing and natural language generation
problems, employing advanced RNNs, like long short-term memory, to solve complex
text generation tasks. It also explores embeddings—high-quality representations of
words in a language.
In this second edition, few advanced state-of-art embeddings and industrial
applications are explained along with end-to-end implementation using deep learning.
Each chapter includes several code examples and illustrations.
By the end of the book, you will have a clear understanding of implementing natural
language processing. You will have worked on multiple examples that implement NLP
techniques in the real world. Readers will be comfortable with various NLP techniques
coupled with machine learning and deep learning and its industrial applications,
making the NLP journey much more interesting and improving your Python coding
skills.

Who This Book Is For


This book explains various concepts and implementations to get more clarity when
applying NLP algorithms to chosen data. You learn about all the ingredients you need
to become successful in the NLP space. Fundamental Python skills are assumed, as well
as some knowledge of machine learning and basic NLP. If you are an NLP or machine
learning enthusiast and an intermediate Python programmer who wants to quickly
master natural language processing, this learning path will do you a lot of good.
All you need to know are the basics of machine learning and Python to enjoy the book.

xxiii
Introduction

What You Will Learn


• The core concepts of implementing NLP, its various approaches, and
using Python libraries such as NLTK, TextBlob, spaCy, and Stanford
CoreNLP

• Text preprocessing and feature engineering in NLP along with


advanced methods of feature engineering

• Information retrieval, text summarization, sentiment analysis, text


classification, and other advanced NLP techniques solved leveraging
machine learning and deep learning

• The problems faced by industries and how to implement them using


NLP techniques

• Implementing an end-to-end pipeline of NLP life cycle projects,


which includes framing the problem, finding the data, collecting,
preprocessing the data, and solving it using cutting-edge techniques
and tools

What Do You Need for This Book?


To perform all the recipes in this book successfully, you need Python 3.x or higher
running on any Windows- or Unix-based operating system with a processor of 2.0 GHz
or higher and a minimum of 4 GB RAM. You can download Python from Anaconda and
leverage a Jupyter notebook for coding purposes. This book assumes you know Keras
basics and how to install the basic machine learning and deep learning libraries.
Please make sure you upgrade or install the latest version of all the libraries.
Python is the most popular and widely used tool for building NLP applications. It
has many sophisticated libraries to perform NLP tasks, from basic preprocessing to
advanced techniques.
To install any library in a Python Jupyter notebook, use ! before the pip install.
NLTK is a natural language toolkit and is commonly called “the mother of all NLP
libraries.” It is one of the primary resources when it comes to Python and NLP.

!pip install nltk


nltk.download()

xxiv
Introduction

spaCy is a trending library that comes with the added flavors of a deep learning
framework. Although spaCy doesn’t cover all NLP functionalities, it does many things well.

!pip install spacy


#if above doesn't work, try this in your terminal/ command prompt
conda install spacy
python -m spacy.en.download all
#then load model via
spacy.load('en')

TextBlob is one of data scientists’ favorite libraries when it comes to implementing


NLP tasks. It is based on both NLTK and Pattern. TextBlob isn’t the fastest or most
complete library, however.

!pip install textblob

CoreNLP is a Python wrapper for Stanford CoreNLP. The toolkit provides robust,
accurate, and optimized techniques for tagging, parsing, and analyzing text in various
languages.

!pip install CoreNLP

There are hundreds of other NLP libraries, but these are the widely used and
important ones.
There is an immense number of NLP industrial applications that are leveraged to
uncover insights. By the end of the book, you will have implemented many of these use
cases, from framing a business problem to building applications and drawing business
insights. The following are some examples.

• Sentiment analysis—a customer’s emotions toward products offered


by the business

• Topic modeling extracts the unique topics from the group of


documents.

• Complaint classifications/email classifications/ecommerce product


classification, and so on

• Document categorization/management using different clustering


techniques.

xxv
Introduction

• Résumé shortlisting and job description matching using similarity


methods

• Advanced feature engineering techniques (word2vec and fastText) to


capture context

• Information/document retrieval systems, for example, search


engines

• Chatbots, Q&A, and voice-to-text applications like Siri, Alexa


• Language detection and translation using neural networks

• Text summarization using graph methods and advanced techniques

• Text generation/predicting the next sequence of words using deep


learning algorithms

xxvi
CHAPTER 1

Extracting the Data


This chapter covers various sources of text data and the ways to extract it. Textual data
can act as information or insights for businesses. The following recipes are covered.

• Recipe 1. Text data collection using APIs

• Recipe 2. Reading a PDF file in Python

• Recipe 3. Reading a Word document

• Recipe 4. Reading a JSON object

• Recipe 5. Reading an HTML page and HTML parsing

• Recipe 6. Regular expressions

• Recipe 7. String handling

• Recipe 8. Web scraping

I ntroduction
Before getting into the details of the book, let’s look at generally available data sources.
We need to identify potential data sources that can help with solving data science use
cases.

C
 lient Data
For any problem statement, one of the sources is the data that is already present. The
business decides where it wants to store its data. Data storage depends on the type of
business, the amount of data, and the costs associated with the sources. The following
are some examples.

1
© Akshay Kulkarni and Adarsha Shivananda 2021
A. Kulkarni and A. Shivananda, Natural Language Processing Recipes,
https://doi.org/10.1007/978-1-4842-7351-7_1
Chapter 1 Extracting the Data

• SQL databases

• HDFS

• Cloud storage

• Flat files

F ree Sources
A large amount of data is freely available on the Internet. You just need to streamline the
problem and start exploring multiple free data sources.

• Free APIs like Twitter

• Wikipedia

• Government data (e.g., http://data.gov)

• Census data (e.g., www.census.gov/data.html)

• Health care claim data (e.g., www.healthdata.gov)

• Data science community websites (e.g., www.kaggle.com)

• Google dataset search (e.g., https://datasetsearch.research.


google.com)

W
 eb Scraping
Extracting the content/data from websites, blogs, forums, and retail websites for reviews
with permission from the respective sources using web scraping packages in Python.
There are a lot of other sources, such as news data and economic data, that can be
leveraged for analysis.

Recipe 1-1. Collecting Data


There are a lot of free APIs through which you can collect data and use it to solve
problems. Let’s discuss the Twitter API.

2
Chapter 1 Extracting the Data

P
 roblem
You want to collect text data using Twitter APIs.

S
 olution
Twitter has a gigantic amount of data with a lot of value in it. Social media marketers
make their living from it. There is an enormous number of tweets every day, and every
tweet has some story to tell. When all of this data is collected and analyzed, it gives a
business tremendous insights about their company, product, service, and so forth.
Let’s now look at how to pull data and then explore how to leverage it in the coming
chapters.

How It Works
Step 1-1. Log in to the Twitter developer portal
Log in to the Twitter developer portal at https://developer.twitter.com.
Create your own app in the Twitter developer portal, and get the following keys.
Once you have these credentials, you can start pulling data.

• consumer key: The key associated with the application (Twitter,


Facebook, etc.)

• consumer secret: The password used to authenticate with the


authentication server (Twitter, Facebook, etc.)
• access token: The key given to the client after successful
authentication of keys

• access token secret: The password for the access key

Step 1-2. Execute query in Python


Once all the credentials are in place, use the following code to fetch the data.

# Install tweepy
!pip install tweepy

# Import the libraries

3
Chapter 1 Extracting the Data

import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler

# credentials

consumer_key = "adjbiejfaaoeh"
consumer_secret = "had73haf78af"
access_token = "jnsfby5u4yuawhafjeh"
access_token_secret = "jhdfgay768476r"

# calling API

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)


auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Provide the query you want to pull the data. For example, pulling data
for the mobile phone ABC

query ="ABC"

# Fetching tweets

Tweets = api.search(query, count = 10,lang='en',exclude='retweets',


tweet_mode='extended')

This query pulls the top ten tweets when product ABC is searched. The API pulls
English tweets since the language given is 'en'. It excludes retweets.

Recipe 1-2. Collecting Data from PDFs


Most of your data is stored in PDF files. You need to extract text from these files and store
it for further analysis.

Problem
You want to read a PDF file.

4
Chapter 1 Extracting the Data

Solution
The simplest way to read a PDF file is by using the PyPDF2 library.

How It Works
Follow the steps in this section to extract data from PDF files.

Step 2-1. Install and import all the necessary libraries


Here are the first lines of code.

!pip install PyPDF2


import PyPDF2
from PyPDF2 import PdfFileReader

Note You can download any PDF file from the web and place it in the location
where you are running this Jupyter notebook or Python script.

Step 2-2. Extract text from a PDF file


Now let’s extract the text.

#Creating a pdf file object

pdf = open("file.pdf","rb")

#creating pdf reader object

pdf_reader = PyPDF2.PdfFileReader(pdf)

#checking number of pages in a pdf file

print(pdf_reader.numPages)

#creating a page object

page = pdf_reader.getPage(0)

#finally extracting text from the page

5
Chapter 1 Extracting the Data

print(page.extractText())

#closing the pdf file

pdf.close()

Please note that the function doesn’t work for scanned PDFs.

Recipe 1-3. Collecting Data from Word Files


Next, let’s look at another small recipe that reads Word files in Python.

Problem
You want to read Word files.

Solution
The simplest way is to use the docx library.

How It Works
Follow the steps in this section to extract data from a Word file.

Step 3-1. Install and import all the necessary libraries


The following is the code to install and import the docx library.

#Install docx
!pip install docx

#Import library
from docx import Document

Note You can download any Word file from the web and place it in the location
where you are running a Jupyter notebook or Python script.

6
Chapter 1 Extracting the Data

Step 3-2. Extract text from a Word file


Now let’s get the text.

#Creating a word file object

doc = open("file.docx","rb")

#creating word reader object

document = docx.Document(doc)

#create an empty string and call this document. #This document variable
stores each paragraph in the Word document.
#We then create a "for" loop that goes through each paragraph in the Word
document and appends the paragraph.

docu=""
for para in document.paragraphs.
       docu += para.text

#to see the output call docu


print(docu)

Recipe 1-4. Collecting Data from JSON


JSON is an open standard file format that stands for JavaScript Object Notation. It’s often
used when data is sent to a webpage from a server. This recipe explains how to read a
JSON file/object.

Problem
You want to read a JSON file/object.

Solution
The simplest way is to use requests and the JSON library.

7
Chapter 1 Extracting the Data

How It Works
Follow the steps in this section to extract data from JSON.

Step 4-1. Install and import all the necessary libraries


Here is the code for importing the libraries.

import requests
import json

Step 4-2. Extract text from a JSON file


Now let’s extract the text.

#extracting the text from "https://quotes.rest/qod.json"


r = requests.get("https://quotes.rest/qod.json")
res = r.json()
print(json.dumps(res, indent = 4))

#output
{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "Where there is ruin, there is hope for a treasure.",
                "length": "50",
                "author": "Rumi",
                "tags": [
                    "failure",
                    "inspire",
                    "learning-from-failure"
                ],
                "category": "inspire",
                "date": "2018-09-29",

8
Chapter 1 Extracting the Data

                "permalink": "https://theysaidso.com/quote/
dPKsui4sQnQqgMnXHLKtfweF/
rumi-where-there-is-ruin-there-is-hope-for-a-treasure",
                "title": "Inspiring Quote of the day",
                "background": "https://theysaidso.com/img/bgs/
man_on_the_mountain.jpg",
                "id": "dPKsui4sQnQqgMnXHLKtfweF"
            }
        ],
        "copyright": "2017-19 theysaidso.com"
    }
}

#extract contents
q = res['contents']['quotes'][0]
q

#output

{'author': 'Rumi',
'background': 'https://theysaidso.com/img/bgs/man_on_the_mountain.jpg',
'category': 'inspire',
'date': '2018-09-29',
'id': 'dPKsui4sQnQqgMnXHLKtfweF',
'length': '50',
'permalink': 'https://theysaidso.com/quote/dPKsui4sQnQqgMnXHLKtfweF/
rumi-­where-­there-is-ruin-there-is-hope-for-a-treasure',
'quote': 'Where there is ruin, there is hope for a treasure.',
'tags': ['failure', 'inspire', 'learning-from-failure'],
'title': 'Inspiring Quote of the day'}

#extract only quote


print(q['quote'], '\n--', q['author'])

#output
It wasn't raining when Noah built the ark....
-- Howard Ruff

9
Chapter 1 Extracting the Data

Recipe 1-5. Collecting Data from HTML


HTML is short for HyperText Markup Language. It structures webpages and displays
them in a browser. There are various HTML tags that build the content. This recipe looks
at reading HTML pages.

Problem
You want to read parse/read HTML pages.

Solution
The simplest way is to use the bs4 library.

How It Works
Follow the steps in this section to extract data from the web.

Step 5-1. Install and import all the necessary libraries


First, import the libraries.

!pip install bs4


import urllib.request as urllib2
from bs4 import BeautifulSoup

Step 5-2. Fetch the HTML file


You can pick any website that you want to extract. Let’s use Wikipedia in this example.

response = urllib2.urlopen('https://en.wikipedia.org/wiki/
Natural_language_processing')
html_doc = response.read()

10
Chapter 1 Extracting the Data

Step 5-3. Parse the HTML file


Now let’s get the data.

#Parsing
soup = BeautifulSoup(html_doc, 'html.parser')
# Formating the parsed html file
strhtm = soup.prettify()

# Print few lines


print (strhtm[:1000])

#output

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
  <meta charset="utf-8"/>
  <title>
   Natural language processing - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.
replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonical
Namespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":
0,"wgPageName":"Natural_language_processing","wgTitle":"Natural language
processing","wgCurRevisionId":860741853,"wgRevisionId":860741853,
"wgArticleId":21652,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":
"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Web
archive template wayback links","All accuracy disputes","Articles
with disputed statements from June 2018","Wikipedia articles with
NDL identifiers","Natural language processing","Computational
linguistics","Speech recognition","Computational fields of stud

11
Chapter 1 Extracting the Data

Step 5-4. Extract a tag value


You can extract a tag’s value from the first instance of the tag using the following code.

print(soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)

#output
<title>Natural language processing - Wikipedia</title>
Natural language processing - Wikipedia
None
Natural language processing

Step 5-5. Extract all instances of a particular tag


Here we get all the instances of the tag that we are interested in.

for x in soup.find_all('a'): print(x.string)

#sample output
None
Jump to navigation
Jump to search
Language processing in the brain
None
None
automated online assistant
customer service
[1]
computer science
artificial intelligence
natural language
speech recognition
natural language understanding
natural language generation

12
Chapter 1 Extracting the Data

Step 5-6. Extract all text from a particular tag


Finally, we get the text.

for x in soup.find_all('p'): print(x.text)

#sample output
Natural language processing (NLP) is an area of computer science and
artificial intelligence concerned with the interactions between computers
and human (natural) languages, in particular how to program computers to
process and analyze large amounts of natural language data.

Challenges in natural language processing frequently involve speech


recognition, natural language understanding, and natural language
generation.

The history of natural language processing generally started in the 1950s,


although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Intelligence" which
proposed what is now called the Turing test as a criterion of intelligence.

Note that the p tag extracted most of the text on the page.

Recipe 1-6. Parsing Text Using Regular Expressions


This recipe discusses how regular expressions are helpful when dealing with text
data. Regular expressions are required when dealing with raw data from the web that
contains HTML tags, long text, and repeated text. During the process of developing your
application, as well as in output, you don’t need such data.
You can do allsorts of basic and advanced data cleaning using regular expressions.

Problem
You want to parse text data using regular expressions.

Solution
The best way is to use the re library in Python.

13
Chapter 1 Extracting the Data

How It Works
Let’s look at some of the ways we can use regular expressions for our tasks.
The basic flags are I, L, M, S, U, X.
• re.I ignores casing.
• re.L finds a local dependent.
• re.M finds patterns throughout multiple lines.
• re.S finds dot matches.
• re.U works for Unicode data.
• re.X writes regex in a more readable format.

The following describes regular expressions’ functionalities.


• Find a single occurrence of characters a and b: [ab]
• Find characters except for a and b: [^ab]
• Find the character range of a to z: [a-z]
• Find a character range except a to z: [^a-z]
• Find all the characters from both a to z and A to Z: [a-zA-Z]
• Find any single character: []
• Find any whitespace character: \s
• Find any non-whitespace character: \S
• Find any digit: \d
• Find any non-digit: \D
• Find any non-words: \W
• Find any words: \w
• Find either a or b: (a|b)
• The occurrence of a is either zero or one
• Matches zero or not more than one occurrence: a? ; ?
• The occurrence of a is zero or more times: a* ; * matches
zero or more than that
14
Chapter 1 Extracting the Data

• The occurrence of a is one or more times: a+ ; + matches


occurrences one or more than one time

• Match three simultaneous occurrences of a: a{3}

• Match three or more simultaneous occurrences of a: a{3,}

• Match three to six simultaneous occurrences of a: a{3,6}

• Start of a string: ^

• End of a string: $

• Match word boundary: \b

• Non-word boundary: \B

The re.match() and re.search() functions find patterns, which are then processed
according to the requirements of the application.
Let’s look at the differences between re.match() and re.search().

• re.match() checks for a match only at the beginning of the string. So,
if it finds a pattern at the beginning of the input string, it returns the
matched pattern; otherwise, it returns a noun.

• re.search() checks for a match anywhere in the string. It finds all


the occurrences of the pattern in the given input string or data.

Now let’s look at a few examples using these regular expressions.

Tokenizing
Tokenizing means splitting a sentence into words. One way to do this is to use re.split.

# Import library

import re

#run the split query

re.split('\s+','I like this book.')

['I', 'like', 'this', 'book.']

For an explanation of regex, please refer to the main recipe.

15
Chapter 1 Extracting the Data

Extracting Email IDs


The simplest way to extract email IDs is to use re.findall.

1. Read/create the document or sentences.

doc = "For more details please mail us at: xyz@abc.com, pqr@mno.com"

2. Execute the re.findall function.

addresses = re.findall(r'[\w\.-]+@[\w\.-]+', doc)


for address in addresses.
    print(address)

#Output
xyz@abc.com
pqr@mno.com

Replacing Email IDs


Let’s replace email IDs in sentences or documents with other email IDs. The simplest
way to do this is by using re.sub.

1. Read/create the document or sentences.

doc = "For more details please mail us at xyz@abc.com"

2. Execute the re.sub function.

new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)',
r'pqr@mno.com', doc)
print(new_email_address)

#Output
For more details please mail us at pqr@mno.com

For an explanation of regex, please refer to Recipe 1-6.


If you observe in both instances when dealing with email using regex, we have
implemented a very basic one. We state that words separated by @ help capture email
IDs. However, there could be many edge cases; for example, the dot (.) incorporates
domain names and handles numbers, the + (plus sign), and so on, because they can be
part of an email ID.

16
Chapter 1 Extracting the Data

The following is an advanced regex to extract/find/replace email IDs.

([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)

There are even more complex ones to handle all the edge cases (e.g., “.co.in” email
IDs). Please give it a try.

Extracting Data from an eBook and Performing regex


Let’s solve a case study that extracts data from an ebook by using the techniques you
have learned so far.

1. Extract the content from the book.

# Import library

import re
import requests

#url you want to extract


url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

#function to extract
def get_book(url).
# Sends a http request to get the text from project Gutenberg
raw = requests.get(url).text
# Discards the metadata from the beginning of the book
start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK
.* \*\*\*",raw ).end()
# Discards the metadata from the end of the book
stop = re.search(r"II", raw).start()
# Keeps the relevant text
text = raw[start:stop]
return text

# processing
def preprocess(sentence).
return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

#calling the above function

17
Exploring the Variety of Random
Documents with Different Content
Corrosion is a disease and defect of the teeth when they become carious and
hollow, which most often happens in the molars, especially if one does not clean
them of the adhering food which becomes moist and consequently produces bad,
sharp [acid] moisture that eats and corrodes them, always gradually increasing,
until it spoils the teeth entirely, which afterward must fall away in pieces not
without pains.
“Mesue ut supra capite proprio.” This, as Mesue writes, is chiefly cured and
removed in three ways. First, by purging as treated upon above. Second, by
dissolving the material which renders them hollow and eats them away; also by
boiling cockles that grow in barley or wheat, in vinegar and holding this in the
mouth. In this vinegar the root of caper and ginger and other similar remedies
must have been previously boiled. Third, by removing the decay, which is done in
two ways. First, by scraping and cleaning the hole and the carious part with a fine
chisel, knife, or file, or other suitable instrument, as is well known to practitioners,
and then by filling the cavity with gold leaves for the preservation of the other
portion of the tooth. Second, by using suitable medicine, such as oak apples or
wild galls, with which the tooth is filled after having been cleaned.

The following editions of Zahnarzneybuchlein, besides the Basle and Mayence


editions noted by Dr. Guerini at page 166, were issued and copies thereof are
preserved in the libraries of the several collectors as stated. Edition of 1530,
printed by Michael Blum, Leipzig, in collection of Edward C. Kirk. Edition of 1536,
printed by Chr. Egenolff, Frankfurt a/M, in collection of William H. Trueman. Edition
of 1541, printed by Chr. Egenolff, Frankfurt a/M, in Dental Cosmos library and
collection of E. Sauvez. Edition of 1576, printed by Chr. Egenolffserben, in
collection of H. E. Friesell.—E. C. K.]
The book, therefore, lacks importance from a dental point of view,
except in the sense that it shows how little skilled in the cure of
dental affections were the German surgeons of those days.
It is worthy of note that this author, also, speaks of anesthetic
inhalations; he, however, only translates, almost to a word, what
Guy de Chauliac says on this subject.
Toward the end of the fifteenth century and in the first half of the
sixteenth there were published in German, by anonymous authors,
some short translations and compilations on dental subjects, taken
especially from Greek and Arabian authors.275 Of these writings, the
first one known, taken from Galen and Abulcasis, was printed at
Basle in 1490; and another—one of the best—saw the light at
Mayence in 1532. These works were perhaps due to intelligent
barbers, or perhaps—and this seems to be the most probable—they
were written, through the initiative of enterprising printers, by
doctors and surgeons, who wished to remain unknown, on account
of the special subject treated; for, owing to the fact that the diseases
of the dental system were generally left in the hands of barbers and
other unprofessional persons, the doctors and surgeons of those
days would have been ashamed to interest themselves in such
things.
Walter Hermann Ryff, of Strasburg, was born in the beginning of the
sixteenth century, and died about 1570. He was a rather mediocre
doctor and surgeon, and a man of the worst morals, so much so that
many cities expelled him from their midst.276 He wrote many
medical works, in which, however, there is very little original matter.
Their principal merit consists, perhaps, in the fact that they were
written not in Latin, as then was universally customary, but rather in
the vernacular of the author and in a popular style; so that Ryff may
be looked upon as the first who endeavored to diffuse among the
people useful medical and hygienic knowledge.
Among Ryff’s books there are two which are very important to us.
One is his Major Surgery, and the other is a pamphlet entitled Useful
Instruction on the Way to Keep Healthy, to Strengthen and
Reinvigorate the Eyes and the Sight. With Further Instruction on the
Way of Keeping the Mouth Fresh, the Teeth Clean, and the Gums
Firm.277
Of these books, there now only exist some extremely rare copies; so
much so that neither Albert von Haller nor Kurt and Wilhelm
Sprengel, who rendered such great services to the history of surgery,
ever had the pleasure of examining them. Dr. Geist-Jacobi has been
more fortunate than they, and has therefore been able to give us
some very interesting information about their contents.
The Major Surgery is a mere compilation which does not contain
anything new of importance. It was published in part in 1545, and in
part in 1572, after the death of the author. The work is illustrated
with very beautiful wood engravings; and it is just this which gives
the principal value to this book. Some of the illustrations contained
in the first part of it—that is, in that published in 1545—represent
dental instruments, notwithstanding dental surgery is not treated in
this part of the book. The author gives notice that he will treat all
that concerns dental affections in the latter part of this book, in a
special chapter. Unfortunately, this chapter was never written,
because death prevented Ryff from completing the second part of
his work.
Fig. 59

Pelican and dental forceps (Walter Hermann


Ryff).
The dental instruments represented in his Major Surgery are many in
number. Among them, first of all, are found the fourteen dental
scrapers of Abulcasis, then the “duck-bill”—designed for the
extraction of dental roots and broken teeth—various kinds of pelican
(Fig. 59 A), the “common dental forceps” (Fig. 59 B), the “goat’s
foot,” and many other kinds of elevators, among which, observes
Geist-Jacobi, may be seen instruments even now in use, and even
some which are said to have been recently invented.
Ryff’s other book is especially noteworthy because, as we have
already mentioned, it treats, for the first time, of dental matters,
independently of general medicine and surgery. This pamphlet,
printed at Würzburg about the year 1544, is made up of sixty-one
pages, and is divided into three parts, the first of which is dedicated
to the eyes, the second to the teeth, and the third to the first
dentition. It is written in popular style, and the author certainly
intended it for the instruction of the public, and not for professional
men; so true is this, that in it he does not speak of the technical part
of the extraction of teeth, or of gold filling—a method already known
for a long time—or of dental prosthesis.
The first part, relative to diseases of the eyes and the manner of
curing them, has no importance for us. The second part begins with
the following paragraph:
“The eyes and the teeth have an extraordinary affinity or reciprocal
relation to one another, by which they very easily communicate to
each other their defects and diseases, so that the one cannot be
perfectly healthy without the other being so too.”278
This last statement is absolutely false, as a disease of the eyes may
very well exist with a perfect condition of the teeth, and vice versa.
However, Ryff has the merit of being, perhaps, the first who has
noted the undeniable relation which exists between the dental and
ocular affections.
After a rapid glance at the anatomy and physiology of the teeth, the
author enumerates the causes of dental disease, which, according to
him, are principally heat, cold, the gathering of humors, and
traumatic actions.
The prophylaxis of dental diseases is beyond any doubt one of the
best parts of the book; however, the ten rules counselled by Ryff for
keeping the teeth healthy—rules which Dr. Geist-Jacobi has made
known to us in full—are reproduced, almost to a word, from
Giovanni d’Arcoli’s work; therefore, the author has no other merit
than that of having translated them into the vulgar tongue, thus
diffusing the knowledge of useful precepts for preventing dental
diseases. We refrain from reproducing the aforesaid rules here, as
they are, with slight variations, identical with those which we gave
when speaking of Arculanus.
Nor can any credit be given to Ryff for the rules which he gives in
regard to the diagnosis of dental pains, as this part of his work is
also taken wholly from the Italian author just mentioned.
After these diagnostic rules Ryff, continuing to translate from the
book of Giovanni d’Arcoli, adds:
“If the pain comes from the gums, extraction is of no use; if it
comes from the tooth, extraction makes it cease; when, lastly, it is in
the nerve, sometimes extraction removes it, and sometimes it does
not, according as the matter obtains or not a free exit.”
The barbers and tooth-drawers, he says, must well remember this
rule, in order to avoid extracting, thoughtlessly and with no benefit,
sound teeth, since then the pain persists in spite of the operation.
Also, it must be borne in mind that, in case of violent pain, it is
necessary to operate as soon as possible, so that the patient may
not faint or be attacked by the falling sickness, if the pain should be
communicated to the heart or brain.
The idea that violent dental pains could give rise to syncope or to
epilepsy (in regard to which we only observe that even very recent
writers enumerate dental caries among the causes of the so-called
reflex epilepsy) is also found in Giovanni d’Arcoli, who expresses
himself in regard to this in the following terms: “Such very violent
pains are sometimes followed by syncope or epilepsy, through injury
communicated to the heart or brain.”279
“The most atrocious pain,” says Ryff, “is when an apostema ripens in
the root;” literal translation of words written about a century before
by Arculanus: “Fortissima dolor est, qui provenit ab apostemate,
quod in radice dentis maturatur.”
Likewise taken from Arculanus is the observation (already made,
however, by much more ancient writers) that “when the cheeks
swell, toothache ceases.” Arculanus, however, expresses himself in a
less absolute manner, and therefore more corresponding to the
truth, since he says “the pain generally ceases” (secundum plurimum
dolor sedatur).
Even in regard to the therapeutics of dental pains, Ryff does not tell
us anything new. Dr. Geist-Jacobi gives this author the merit of
having made, in regard to the cure of dental pains, a distinction
between cura mendosa (that is, imperfect, palliative, tending simply
to calm the pain) and cura vera (that is, directed against the causes
of the disease). But this very important distinction is also taken from
Arculanus, who in his turn took it from Mesue. In fact, after having
spoken of the general rules relative to the cure of dental diseases,
Giovanni of Arcoli adds: “As to the particular therapy, it is divided
into cura mendosa and cura vera, as may be found in Mesue. And
the cura mendosa is so called because it calms the pain by
abolishing sensibility, not by taking away the cause of it. Such is, for
the sake of example, the cure, consisting in fumigations of henbane,
made to reach the diseased tooth by means of a small tube, adapted
to a funnel.”
The third part of Ryff’s pamphlet has as its title:
“How the pains of the gums should be calmed or mitigated in
suckling infants, so as to promote the cutting of the teeth without
pain.”
This part, as Geist-Jacobi informs us, is very brief, not taking up
more than a page and one-half of print. Neither does it contain
anything of importance. To render the cutting of teeth easier, Ryff
advises that infants should have little wax candles given to them to
chew and the gums anointed with butter, duck’s fat, hare’s brains,
and the like. The tooth of a wolf may be hung around the neck of
the child, so that it may gnaw at it. It is also recommended that the
head of the child should be bathed with an infusion of chamomile.
From what has been said, one may see very clearly that the
aforesaid book is, from the scientific point of view, entirely valueless,
because the best part of it is merely copied from the work of
Giovanni d’Arcoli. However, the author has the indisputable merit of
having endeavored to diffuse the knowledge of useful precepts of
dental hygiene. His book, besides, we repeat, has great historical
value, for from it dates the beginning of odontologic literature,
properly so called.
On this point we believe it is necessary to correct an error into which
Dr. Geist-Jacobi has fallen. At the beginning of his very valuable
article on Walter Hermann Ryff280 he says: “In the fifth century of
the Christian era, the iatrosophist Adamantius of Alexandria
published an exclusively odontalgic work, of which, however, we only
know the title.” The same he repeats in his History of Dental Art (pp.
55 and 56), without, however, giving us any proof of his statement.
“Of the odontologic treatise of Adamantius,” he says, “unfortunately
the title alone is known to us, and even that has reached us
indirectly, that is, by means of Ætius; it is of the following tenor.”
Now, whoever takes the trouble to translate these Greek words will
easily perceive that they do not constitute one title, but two distinct
ones (which even Dr. Geist-Jacobi has had to unite by the
conjunction and). These, however, are nothing more than the titles
of two chapters of the Tetrabiblos of Ætius, as anyone may see for
himself by turning over the pages of this work either in the Greek
original, or in the beautiful Latin translation of Giano Cornario
(Venice, 1553). In this great composition of Ætius dental diseases
are treated of in Chapters XXVII to XXXV of Sermo IV, Tetrabiblos II;
and the two Greek titles above referred to are the titles of Chapters
XXVII and XXXI.
In the translation of Giano Cornario they read as follows:
Cura dentium a calido morbo doloroso affectorum, ex Adamantio
sophista (cure of teeth affected by warm, painful disease, according
to Adamantius the sophist).
Cura dentium a siccitate dolore affectorum, ex Adamantio sophista
(cure of teeth affected by pain from dryness, according to
Adamantius the sophist).
The work of Adamantius, from which Ætius took the contents of the
chapters thus entitled, is lost to us, but we have no reason, and not
even the least indication, for supposing that this work was a treatise
on dental diseases, and not one on general medicine. It is absurd to
consider the above-mentioned titles as belonging to an odontological
monograph, on the one hand, because, admitting for a moment the
existence of such a work, it should have had but one title and not
two, and on the other hand, because it is by no means to be
supposed that a great and wise physician, such as Adamantius
undoubtedly was, should have had the whim to write a book, not on
dental disease or on dental pains in general, but only and exclusively
on dental pains caused by heat or by dryness. What reason would
there have been for not extending the treatment of the subject to
those cases of odontalgia resulting from humidity or from cold, that
is, from causes as common and, according to the ideas of that time,
very frequently associated with one of the first two (as humidity with
heat, and cold with dryness)?
Besides, if the titles of the two chapters spoken of be compared with
those of the others, in which Ætius treats of dental affections, such
analogy will be noticed between the various titles as to make us
consider that they have been formulated by Ætius himself, even
when the contents of these chapters are taken from other writers.
So that the two aforesaid titles not only do not belong to any dental
work, but probably they have never existed, even as simple titles of
chapters, in the medical book of Adamantius, from which the
contents of the two chapters of Ætius above mentioned have been
taken.
In order that every one may easily be convinced that the two titles
made so conspicuous by Dr. Geist-Jacobi have nothing particular
about them, but are, instead, perfectly analogous to the titles of
various other chapters of Ætius, we give here the translation of the
titles of five chapters, all concerning dental maladies, that is, the two
chapters in discussion and other three:
Chapter XXVII: Cure of teeth affected by warm, painful disease,
according to Adamantius the sophist.
Chapter XXIX: Cure of teeth affected with pain from humidity.
Chapter XXXI: Cure of teeth affected by pain from dryness,
according to Adamantius the sophist.
Chapter XXXII: Cure of teeth affected by pain from heat and
humidity.
Chapter XXXIII: Cure of decayed teeth, according to Galen.
It appears very clear, therefore, from the great analogy existing
between the headings of all the above-mentioned chapters, that the
titles referred to by Geist-Jacobi have not at all the historical
importance and significance that he attributes to them, and that the
same have been formulated by Ætius himself. To argue from such
titles that Adamantius was the author of a book on dentistry is not
only inadmissible, for all the reasons already given, but also because
if it were allowable to reason with such lightness, it might also be
stated—by arguing from the title of Chapter XXXIII—that Galen was
the author of a monograph on the treatment of dental caries; a
thing which is absolutely untrue. Consequently, the beginning of
odontologic literature cannot be traced back to Adamantius, but, as
Dr. Geist-Jacobi would have it, to an author much less ancient, that
is, to Walter Hermann Ryff, or, if it is preferred, even to the
anonymous writers of the odontologic compilations which appeared
in Germany at the end of the fifteenth century.
Andreas Vesalius. We must now speak of Andreas Vesalius, an
extraordinary man, who by his genius infused new life into medical
science, and who, although he gave but little attention to dental
matters, yet fully deserves a place of honor in the history of
dentistry; for this, like every other branch of medicine, received
great advantage from his reforming work, which broke down forever
the authority of Galen, thus freeing the minds of medical men from
an enslavement which made every real progress impossible.
Andreas Vesalius was born at Brussels, December 31, 1514. He
studied at Louvain and then at Paris, where at that time great
scientists taught, and among others the celebrated anatomist
Jacques Dubois, generally known by the Latinized name of
Sylvius.281 The latter, a great admirer of Galen, whose anatomical
writings served as texts for his lectures, became jealous of the
young Belgian student, who was his assistant, and who gave
undoubted proofs of great genius, and of extraordinary passion in
anatomical research. Vesalius often defied the greatest dangers in
order to obtain corpses either from the cemetery of the Innocents or
from the scaffold at Montfaucon. He soon surpassed his most
illustrious masters, and at only twenty-five years of age published
splendid anatomical plates, which astonished the learned. He
acquired also great renown as surgeon, and in this capacity he
followed the army of Charles V in one of his wars against France.
After having been professor of anatomy in the celebrated University
of Louvain (Belgium), he was invited by the Venetian Republic to
teach in the University of Padua, which, through him, became the
first anatomical school in Europe. Yielding to the requests of the
magistrates of Bologna and Pisa, he also taught in those famous
universities, before immense audiences.

Andreas Vesalius.
Before Vesalius, Galen’s anatomy had served as the constant basis
for the teaching of this science. Although even from the end of the
fifteenth century dead bodies were dissected in all the principal
universities, the teachers of anatomy always conformed, in their
descriptions, to those of Galen, so that the authority of this master,
held infallible, prevailed even over the reality of facts.
Vesalius, for the first time, dared to unveil and clearly put in
evidence the errors of Galen; but this made him many enemies
among the blind followers and worshippers of that demigod of
medicine. Europe resounded with the invectives that were bestowed
upon Vesalius. Among others, there rose against him Eustachio at
Rome, Dryander at Marburg, Sylvius at Paris, and this last did not
spare any calumny that might degrade his old pupil, who had
become so celebrated. In spite of this, the fame of Vesalius kept on
growing more and more, so much so that Charles V called him to
Madrid, to the post of chief physician of his Court, a place which he
kept under Philip II, also after the abdication of Charles V. The good
fortune of Vesalius, unhappily, was not to be of long duration. In
1564 a Spanish gentleman died, in spite of the care bestowed upon
him by Vesalius, and the illustrious scientist requested from the
family, and with difficulty obtained, the permission to dissect the
body. At the moment in which the thoracic cavity was opened the
heart was seen, or thought to be seen, beating. The matter reached
the ears of the relations of the deceased, and they accused Vesalius,
before the Inquisition, of murder and sacrilege; and he certainly
would not have escaped death except by the intervention of Philip II,
who, to save him, desired that he should go on a pilgrimage to the
Holy Land, as an expiation. On his return, the ship which carried
Vesalius was wrecked, and he was cast on a desert beach of the Isle
of Zante, where, according to the testimony of a Venetian traveller,
he died of hunger, October 15, 1564.
Vesalius left to the world an immortal monument, his splendid
treatise on Anatomy,282 published by him when only twenty-eight
years of age, and of which, from 1543 to 1725, not less than fifteen
editions were issued. The appearance of this work marked the
commencement of a new era. The struggle between the supporters
of Galen and those of Vesalius rendered necessary, on both sides,
active research concerning the structure of the human body, so that
anatomy, the principal basis of scientific medicine, gradually became
more and more perfect, and, as a consequence of this, as well as of
the importance which the direct observation of facts acquired over
the authority of the ancients, there began in all branches of
medicine a continual, ever-increasing progress, which gave and still
gives splendid results, such as would have been impossible under
the dominion of Galenic dogmatism.
In the great work of Vesalius the anatomy of the teeth is
unfortunately treated with much less accuracy than that of the other
parts of the body. However, his description of the dental
apparatus283 is far more exact than that of Galen, and represents
real progress. The number of the roots of the molar teeth (large and
small) is indicated by Galen in a very vague and inexact manner,
since he says that the ten upper molars have generally three,
sometimes four roots, and that the lower ones have generally two,
and rarely three. Vesalius, having examined the teeth and the
number of their roots in a great number of skulls, was able to be
much more precise. In regard to roots, he makes, for the first time,
a very clear distinction between the premolars next to the canine
(small molars) and the other three, and says that the former in the
upper jaw usually have two roots, and in the lower, one only, whilst
the last three upper molars usually have three roots and the lower
ones two. As everyone sees, these indications are, in the main,
exact.
Other important facts established by Vesalius are as follows:
The canines are, of all the teeth, those which have the longest roots.
The middle upper incisors are larger and broader than the lateral
ones, and their roots are longer. The roots of the last molars are
smaller than those of the two preceding molars. In the penultimate
and antepenultimate molars, more often than in the other teeth, it
sometimes happens that a greater number of roots than usual are
found, it being not very rare to meet with upper molars with four
roots, and lower ones with three. The molars are not always five in
each half jaw; sometimes there are only four, either on each side, or
on one side only, in only one jaw or in both. Such differences
generally depend on the last molar, which does not always appear
externally, remaining sometimes completely hidden in the maxillary
bone, or only just piercing with some of its cusps the thin plate of
bone which covers it; a thing which Vesalius could observe in many
skulls in the cemeteries.
In regard to the last molar, the author speaks of its tardy eruption
and of the violent pains which not unfrequently accompany it. The
doctors, he adds, not recognizing the cause of the pain, to make it
cease have recourse to the extraction of teeth, or else, attributing it
to some defects of the humors, overwhelm the sufferer with pills and
other internal remedies, whereas the best remedy would have been
the scarification of the gums in the region of the last molar and
sometimes the piercing of the osseous plate which covers it.
This curative method, of which no one can fail to recognize the
importance, was experimented by Vesalius on himself, in his twenty-
sixth year, precisely at the time that he had just begun to write his
great treatise on anatomy.
The existence of the central chamber of the teeth appears to have
been unknown to Galen, as he does not allude to it in the least.
Vesalius was the first to put this most important anatomical fact in
evidence. He expresses an opinion that the central cavity facilitates
the nutrition of the tooth. He says, besides, that when a hole is
produced in a tooth by reason of acrid corrosive humors, the
corrosion, when once the internal cavity is reached, spreads rapidly
and deeply in the tooth, owing to the existence of the said cavity,
and sometimes reaches even the end of the root.
In the chapter in which Vesalius treats of the anatomy of the teeth
(Chapter XI, p. 40), two very well-drawn figures are found, one of
which represents a section of a lower molar, showing the pulp cavity
and its prolongation into the two root canals. The other represents
the upper and lower teeth of the right side, in their reciprocal
positions, and shows very clearly their general shape, the length of
their roots, and the number of these.
The changes which take place in the alveolus, after the extraction of
a tooth have not escaped the notice of Vesalius. He says that after
an extraction the walls of the alveolus approach one another, and
the cavity is gradually obliterated.
Aristotle had affirmed that men have a greater number of teeth than
women. Vesalius declares this opinion absolutely false—although,
after Aristotle, it has been repeated by many other ancient writers—
and says that anyone can convince himself that the assertion of
Aristotle is contrary to the truth, as it is possible for everybody to
count his own teeth.
In spite of this, we find the above-mentioned error even in writers
subsequent to Vesalius; for example, in Heurnius (professor at
Leyden toward the end of the sixteenth century), who expresses an
opinion that rarely do women have thirty-two teeth, like men.
We find but little in Vesalius concerning the development of the
teeth. He, indeed, made some observations and researches on this
point, but these, from their insufficiency, led him to quite mistaken
conclusions. The teeth of children, he says, have imperfect, soft,
and, as it were, medullary roots; and the part of the tooth which
appears above the gums is united to the root, so to say, as a mere
appendix, after the fall of which there grows from the root the
permanent tooth. This error arose in the mind of Vesalius from
observing that when children lose their milk teeth, these have the
appearance of a kind of stump, as if the root had actually remained
in the socket. Besides this, he had observed with what facility the
milk teeth fall out; and he here calls to mind that, when about seven
years old, he himself and his companions used to pluck out their
loosened teeth, and especially the incisors, with their fingers, or with
a thread tied around the tooth. The softness of the dental roots in
children, the easy fall of the milk teeth, and the want of the lower
part of the roots in these, must have raised the idea in his mind that
the roots of the milk teeth remained in the socket, and that the
upper part of the temporary teeth, instead of being a continuation of
the root, was joined to this as a simple appendix, and in a very weak
way, as though designed to remain in place for a limited length of
time only.
In Vesalius284 is found a dental terminology—Latin, Greek, Hebrew,
and Arabic—which affords some interest. The incisors are called in
Latin incisorii, risorii, quaterni, quadrupli; and the two middle incisors
have been denominated by some authors duales. The canines are
called in Greek kynodontes, which means the same as the Latin
canini, dog’s teeth. In Latin they have been also denominated
mordentes, and by some also risorii, a name which by others is
given to the incisors, as we have already seen. The molars have also
been called in Latin maxillares, paxillares, mensales, genuini.285 But
some authors give this last name only to the last molars, or wisdom
teeth, dentes sensus et sapientiæ et intellectus. These teeth have
also been called serotini (that is, tardy), ætatem complentes (that is,
completing the age, the growth), and also, in barbaric Latin,
cayseles or caysales, negugidi, etc.
In the rebellion against the authority of the ancients, Vesalius had a
predecessor whose name, deservedly famous, may be recorded
here. Paracelsus (born in 1493 at Maria-Einsiedeln, Switzerland), on
being nominated, in 1527, Professor of Medicine and Surgery at
Basle, inaugurated his lectures by burning in the presence of his
audience, who were stunned by such temerity, the writings of Galen
and Avicenna, just as Luther, seven years before, had burnt in the
public square of Wittenberg the papal bulls and decretals. The
sixteenth century, in its exuberance of intellectual life, was
undoubtedly one of the grandest centuries in history; human
thought in that glorious epoch shattered its chains, and declared its
freedom both in matters of science and of religion.
Paracelsus, a man of powerful genius, but not well balanced in mind,
of corrupt morals, and of an unlimited pride, had, notwithstanding
these undeniable defects, the merit of beginning a healthy reform in
the science and practice of medicine, by substituting the study of
nature for the authority of the ancients and by giving a great
importance to chemistry, both for the explanation of organic
phenomena and for the cure of disease.
It is to be lamented that this man of genius did not contribute in any
way to the progress of dentistry. His works have no importance for
us. As a matter of mere curiosity we only record here that Paracelsus
considered the too precocious development of the teeth as a great
anomaly, and regarded as monsters those children who were born
with teeth.286

Paracelcus.
Gian Filippo Ingrassia.
Gabriel Fallopius.
Gian Filippo Ingrassia (1510 to 1580), a distinguished Sicilian
anatomist, was one of the first who spoke of the dental germ. He
says that the existence of the tooth properly so called is preceded by
that of a soft dental substance enclosed in the bone, and which he
considers almost as a secretion of the latter.
Matteo Realdo Colombo, of Cremona, a pupil of Vesalius and his
successor in the professorship of Anatomy at Padua, added but little,
as regards the teeth, to what his master has taught. He combated
the erroneous idea that the teeth were formed in the alveoli shortly
before their eruption. Having dissected the jaws of many fetuses,
and having always observed in them the existence of teeth, he could
affirm with every certainty that the teeth begin to be formed in intra-
uterine life.
Like Vesalius, Realdo Colombo believed that the permanent teeth
were developed from the roots of the milk teeth; and, therefore, he
advised the utmost caution in extracting these, since, if the whole
root were removed, the tooth would not grow again.287
Gabriel Fallopius (1523 to 1562), the eminent anatomist of Modena,
also a disciple of Vesalius, carried out accurate and successful
researches in regard to the development of the teeth, and made
them known in his book, Observationes anatomicæ, published at
Venice in 1562, the year in which he died.
His investigations enabled him to show the falsity of the opinion held
by Vesalius, that the permanent teeth are developed from the roots
of the temporary ones. He was, besides, the first who spoke in clear
terms of the dental follicle.
The teeth, says Fallopius,288 are generated twice over, that is, the
first time in the uterus, after the formation of the jaws, and the
second time in extra-uterine life, before the seventh year. The first
teeth are, at the time of birth, still imperfect, without roots,
completely enclosed in their alveoli, and formed of two different
substances; the part with which they must break their way out is
osseous and hollowed; the deeper part, instead, is soft and humid
and is seen covered with a thin pellicle, a thing which may also be
observed in the feathers of birds when they are still tender. In fact,
the part of the feather which comes out of the skin is hard and
corneous, whilst the part which is embedded in the wings is soft and
humid and has the appearance of coagulated blood or mucus. So
also in the fetal teeth, the part corresponding to the future root
presents itself like coagulated mucus. Little by little this soft
substance hardens and becomes osseous, thus constituting the root
of the tooth.
Fallopius’ reference to the analogy between the development of
teeth and that of feathers was highly important, as a point of
departure for embryological researches which showed clearly the
real nature of teeth, thus destroying the mistaken idea—held by
Galen and many other authors—that these organs were bones.
On coming to speak of the teeth generated in extra-uterine life, that
is of the permanent teeth, Fallopius relates having observed that
they have their origin in the following manner: A membranous
follicle is formed inside the bone furnished with two apices, one
posterior (that is to say, deeper down, more distant from the surface
of the gums), to which is joined a small nerve, a small artery, and a
small vein (cui nervulus, et arteriola, et venula applicantur); the
other anterior (that is more superficial), which terminates in a
filament or small string, like a tail. This string reaches right to the
gum, passing through a very narrow aperture in the bone, by the
side of the tooth which is to be substituted by the new one. Inside
the follicle is formed a special white and tenacious substance, and
from this the tooth itself, which at first is osseous only in the part
nearest the surface, whilst the deeper part is still soft, that is,
formed of the above-mentioned substance. Each tooth comes out
traversing and widening the narrow aperture through which the
“tail” of the follicle passes. The latter breaks, and the tooth comes
out of the gum, bare and hard; and in process of time the formation
of its deeper part is completed.
The author says that his long and laborious researches into the
development of the teeth were carried out with great accuracy, and
he is, therefore, in a position to give as absolute certainties the facts
exposed by him. Indeed, the observations of Fallopius were, for the
most part, confirmed by subsequent research. As to the “tail” of the
dental follicle, it is identical with the iter dentis or gubernaculum
dentis of some authors. Fallopius described it as a simple string, but
later on this prolongation of the dental follicle has been considered,
at least by some, as the narrowest part or neck of the follicle itself,
that is, as a channel through which the tooth passes, widening it, on
its way out, and precisely for this reason it has been called iter
dentis (the way of the tooth) or gubernaculum dentis (helm or guide
of the tooth).
Bartholomeus Eustachius, another great anatomist of the sixteenth
century, occupied himself in the study of teeth with special interest,
and wrote a very valuable monograph on this subject. He was a
native of San Severino, Marche (Italy), and a contemporary of
Vesalius, Ingrassia, Realdo Colombo, and Fallopius; he died in 1574,
after having immortalized his name through many anatomical
discoveries and writings of the highest value.
Bartholomeus Eustachius
His book on the teeth, Libellus de dentibus, published at Venice in
1563, is the first treatise ever written on the anatomy of teeth, and
represents a noteworthy progress in this branch of study.
In this little book—divided into thirty chapters, forming in all ninety-
five pages—the author treats with great accuracy and in an
admirable manner all that concerns the anatomy, physiology, and
development of the teeth.
Eustachius not only treasured up what ancient authors had written
on this subject, but he himself made very long and patient
researches and observations on men and animals, on living
individuals as well as on corpses, and not only on adult subjects, but
also on children of every age, on stillborn children and on abortive
fetuses.
The macroscopic anatomy of the teeth was brought by him to a high
degree of perfection. Very wonderful, among other things, is the
accuracy with which he studied and specified in several synoptical
tables the number of the roots of molar teeth, and all the variations
occurring not only in their number, but also in their form, length, etc.
In Chapter IV, speaking of the means by which teeth are held in
their sockets, Eustachius mentions in quite explicit terms the
ligaments of the teeth. He begins by saying that the perfect
correspondence between the dental roots and the alveoli, both in
shape and in size, is one of the elements which contribute to the
firmness of the teeth, since the alveolus, being exactly applied, on
all sides, to the root or roots of the tooth, causes the latter, by this
simple fact, to be fixed in a determined position. Also, the nerves
inserted in each single tooth contribute, as was already the opinion
of Galen, to the stability of these organs. “There exist besides”—
Eustachius continues—“very strong ligaments, principally attached to
the roots, by which these latter are tightly connected with the
alveoli” (adsunt præterea vincula fortissima radicibus præcipue
adherentia, quibus præsepiolis arctissime colligantur). Lastly, says
the author, the gums, too, embracing the teeth at their exit from the
alveoli, contribute to their firmness. And here Eustachius notes that
in the joining of the gums to the teeth there is great analogy to that
of the skin with the finger nails; a very proper observation, which
makes us almost suppose that the perspicacious mind of Eustachius
may have guessed the kindred nature of nails and teeth.
In Chapter XV are related the researches made by the author to
ascertain at what period the development of the teeth begins. Here
is a passage of this chapter, almost literally translated:
“Hippocrates, before anyone else, wrote that the first teeth are
formed in the uterus. Wishing to assure myself thereof, I dissected
many abortive fetuses, and by very careful observations I found it to
be true that the teeth have their origin during intra-uterine life.
Wherefore, the opinion of those who consider that the first teeth are
formed from the milk, and those of the second dentition from food
and drink, must be declared entirely false. In fact, by opening both
jaws of a stillborn fetus, one may find, on each side of each jaw, the
incisors, the canine, and three molars, partly mucous and partly
osseous, and already sufficiently large and entirely surrounded by
their alveoli. Then removing, with a skilful hand, the incisors and the
canines, there may be observed a very thin partition only just
ossified; and if this be removed with equal care, an equal number of
incisors and canines, almost mucous and very much smaller, appear,
which, enclosed in special alveoli behind the first, would exactly
correspond in position each with its congener, if in both jaws the
canine were not resting for the greater part on the next incisor so as
almost to hide it.”
As to the molars (by which name also the bicuspids are here meant),
Eustachius says that he found but three on each side, and no trace
whatever of the others. Nevertheless, he considers it quite probable
that the germs of the latter should also exist in the fetus, although
so small as to escape observation. He gives many ingenious reasons
in support of his mode of thinking, and comes to the general
conclusion, that not only the temporary teeth but also the
permanent ones have, all of them, their origin during fetal life; a
false conclusion simply because too general, and which shows once
more how, in biological science, one runs great risk of falling into
error whenever one tries to draw too free deductions from observed
phenomena.
The researches of Fallopius and Eustachius confirm and complete
each other. These two eminent anatomists, who gave great glory to
Italy by their immortal discoveries and works, were the first to shed
a brilliant light upon the development of the teeth, and thus opened
up the way to all subsequent research on odontogeny.
In settling the period in which the formation of the teeth begins,
Fallopius was still more successful than Eustachius. His patient
investigations showed him that the development of the teeth
commences partly in the uterus and partly after birth, which is
perfectly true, as was made clear by later embryological researches.
Fallopius found in each fetal jaw twelve teeth.289 In this he agrees
perfectly with his contemporary, Eustachius, who, as we have seen a
short while ago, found in fetusus, only just born, the incisors, the
canines, and three molars for each side of each jaw. Eustachius,
however, observed in the fetus the germs of the permanent incisors
and canines as well, a thing not noted by Fallopius.
It is not to be wondered at that some discrepancy should exist
between the observations of these two eminent anatomists. The
researches of which we are speaking are sufficiently delicate and
difficult; and even much more recent authors are far from agreeing
perfectly, as far as regards the period, in which the development of
the teeth begins. Serres, in his Essai sur l’anatomie et la physiologie
des dents (Paris, 1817), sustains the view that in the fetus he has
observed the germs of all the teeth, both temporary and permanent,
while Joseph Linderer (Handbuch der Zahnheilkunde, Berlin, 1842)
says that, although he has followed the preparative method
indicated by Serres, he could never discover in the fetus the germs
of all the teeth. Perhaps, he adds, the time when the development
of the teeth begins varies considerably in individuals, just as we
remark differences in the time of eruption.
In Chapter XVII of his book, Eustachius speaks of the process of
formation of the teeth, which he studied in abortive fetuses, in
stillborn children, in children a few months old, and also in kids.
On dissecting a fetal jaw, there may be found on each side, as we
have already seen, the incisors, the canines, and three molars, still
soft and imperfect, separated from one another by very thin osseous
partitions. Each of these teeth is enclosed within a follicle or little
bag of a grayish white color, rather more mucous and glutinous than
membranous, and in form somewhat like the pod of a vegetable,
with the only difference that it shows an opening at one of the
extremities, from which the tooth somewhat protrudes, as if it were
germinating. The more recent and softer the tooth, the more its
follicle has a mucous appearance and differs from the nature of
membranes. As it does not adhere to the underlying tooth, it is easy
to separate them. As to the tooth, it is at that period of its
development partly osseous and partly mucous, since that part
which later on projects from the gum soon becomes transformed
into a white thin and concave scale, which gives the idea of one of
the little cells of a honeycomb. This scale is harder and more
conspicuous in the incisors, since these, at this stage, are better
formed; the canines are less advanced in development, and the
molars still less; and among these latter, those are less developed
which are more distant from the canines. The deeper part of the
tooth consists of a mucous and tenacious substance, harder,
however, than the substance of the follicle, and of a whitish color
with a tendency to dark red, translucent, and somewhat brilliant.
Thus, says Eustachius, the teeth present themselves in a human
fetus; but he who cannot obtain a human fetus may observe the
same things in a kid.
Although the author does not express himself very explicitly, he
seems to consider the follicle of the tooth substantially identical with
its ligament. “This is at first mucous, but afterward, becoming more
consistent, causes the tooth to adhere to the socket and gum very
firmly, as if it were glued.”
“As the part of the tooth which comes out of the gum projects from
the aperture of the follicle like a gem from its bezel, so—says
Eustachius—some believe that the crown of a temporary tooth is a
mere appendix, and that the follicle comes out of its concavity
through a dividing line which they imagine to exist between this
supposed appendix and the remaining part of the tooth. But
assuredly those who assert such things show that they have studied
the anatomy of the teeth so carelessly that, by this one error, they
make manifest their great ignorance together with their great
temerity.290 The line which is observed on the tooth on the part
corresponding to the adhesion of the gingival margin and of the
dental ligament is very superficial, and after having scraped it away,
there does not remain any trace of a division. But apart from this
everyone can very easily observe, even in infants, or in kids, that the
tooth when ossified does not present any line of division and that
the still mucous follicle envelops it freely, and may be easily
separated from the tooth; which would not be the case, if the follicle
issued from between the tooth and its supposed appendix.”
Thus, Eustachius declares entirely false the opinion already
expressed by Celsus, that the permanent tooth grows from the root
of the milk tooth. He affirms clearly and decisively that between the
external and the radical part of a milk tooth no real division exists,
and that the ossification of the tooth, beginning from the crown,
proceeds without any interruption right down to the end of the root.
If it were true, says he, that in children only the imaginary epiphysis
or appendix falls, and that the new tooth is substantially represented
by the remaining part of the first, it could never happen, as instead
it often does, that the new tooth appears before the first one falls.
Besides, between the lower part of the first tooth and the upper part
of the second no correspondence exists either in size or shape, as
ought necessarily to be the case if the two parts were joined
together. This is not all; the lower part of the temporary tooth is
perforated, and receives in its interior bloodvessels and nerves,
whilst the upper part of the permanent tooth is quite massive and
imperforated. How, then, could this second tooth transmit
bloodvessels and nerves into the cavity of the first? Again, how could
the continuity of these bloodvessels and nerves with their respective
branches be possible, if an imperforate body, such as the crown of
the permanent tooth, were really interposed?
But what is the use of so many arguments? exclaimed Eustachius. To
remove even the slightest doubt and to put an end to any
controversy on such a point, only one fact is sufficient, which is
revealed to us by anatomical dissection, and that is, that the teeth
which appear about the seventh year are not only not united to
those which fall at the same period, but cannot even be in contact
with them, owing to the presence of a thin osseous partition.
In the following chapter291 Eustachius speaks of the central cavity of
the teeth and of the substance contained in it. In young teeth, he
says, the dental cavity is very large, in proportion to the size of the
tooth. According to some anatomists, the central cavity of a tooth is
coated by a very soft and thin membrane, formed by a tissue of very
small vessels and nerves; and besides, this cavity is filled with
marrow, like hollow bones. The observations of the author, however,
do not agree with these statements. The dental cavity does not
contain any fatty substance analogous to the marrow of bones. As to
the above-mentioned membrane, Eustachius doubts its existence.
The large hollow existing in children’s teeth contains, he says, a
mucous substance, somewhat hard, and very smooth at its surface—
almost like a cuticle—but which has rather the appearance of a
concretion than of a membranous tissue. At any rate, adds
Eustachius, if the substance alluded to is made to dry up in the
shade, it acquires an appearance not unlike that of a membrane. It
is certain, however, that at an early age the substance contained in
the dental cavity does not adhere to the walls of the latter after the
manner of a periosteum, but is found in simple contact with the
same, and can, therefore, be separated from them with the greatest
ease.
As years pass by, the dental cavity becomes narrower and narrower,
because the substance contained inside the tooth gradually becomes
ossified at the surface, adhering to the dental scale previously
formed, in the very same manner as the internal or woody part of a
tree adheres to the bark. Of the two hard substances which make up
a tooth, the outer one is white, tense, and dense, like marble, the
underlying one, instead, is somewhat dark, rough, and less compact.
To observe accurately the above-mentioned facts, the author advises
searching for them, first, in the molar teeth of the ox or the ram,
and then in human teeth, and likewise, first in children or in recently
born animals, and then in adults.
Chapters XIX and XX are, comparatively speaking, of little
importance. In the former the author undertakes especially to

You might also like