How can I split a text into sentences?

Question

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

My old regular expression works badly:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

i want to do this, but i want to split wherever there is either a period or a newline — yishairasowsky, Dec 30 '19 at 14:40

score 192 · Accepted Answer · edited Jan 23 '17 at 13:48

192

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)

edited Jan 23 '17 at 13:48

Georg

960
2
7
22

answered Jan 01 '11 at 22:27

Ned Batchelder

364,293
75
561
662

Thanks, i hope this library will works with Russian language. – Artyom Jan 01 '11 at 23:10
3

@Artyom: It probably can work with Russian -- see [can NLTK/pyNLTK work “per language” (i.e. non-english), and how?](http://stackoverflow.com/questions/1795410/can-nltk-pynltk-work-per-language-i-e-non-english-and-how). – martineau Jan 02 '11 at 00:28
4

@Artyom: Here's direct link to the online documentation for [`nltk .tokenize.punkt.PunktSentenceTokenizer`](http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html). – martineau Jan 02 '11 at 00:32
19

You might have to execute `nltk.download()` first and download models -> `punkt` – Martin Thoma Jan 12 '15 at 18:36
2

to save some typing: `import nltk` then `nltk.sent_tokenize(string)` – Yibo Yang Mar 22 '17 at 02:30
2

This fails on cases with ending quotation marks. If we have a sentence that ends like "this." – Fosa Feb 21 '18 at 05:16
@Fosa I think that's not a valid sentence, the quotation mark shall precede the period. – szedjani Oct 31 '19 at 09:55
2

Okay, you convinced me. But I just tested and it does not seem to fail. My input is `'This fails on cases with ending quotation marks. If we have a sentence that ends like "this." This is another sentence.'` and my output is `['This fails on cases with ending quotation marks.', 'If we have a sentence that ends like "this."', 'This is another sentence.']` Seems correct for me. – szedjani Oct 31 '19 at 10:37
After compiling comments from all people, it still fails to parse the following sentence: "FIG. 1A is a simplified pin out diagram for an integrated circuit which includes a serial peripheral interface I/O according to an embodiment of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims herein." It separates "FIG" and "1A". I have to add a special "if" statement to handle this. if "FIG." in text: text = text.replace("FIG.","FIG") This is very ad hoc, but I am not sure if there's a better way to generalize it. – fanchyna May 06 '20 at 15:41
How to get an array of sentences from the text? – huy Aug 21 '21 at 02:30
This fails on the simple example `4. Building, sculpting, moving, and mending things in hard to reach places and at small scales (e.g. dig tunnels, deliver adhesives to cracks)`. – HappyFace Feb 17 '23 at 10:08
@szedjani https://en.wikipedia.org/wiki/Quotation_marks_in_English#Order_of_punctuation Anyway, we can't control the grammar of the text we're analyzing – endolith May 03 '23 at 01:03

score 154 · Answer 2 · edited May 02 '23 at 14:53

This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. "Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst."

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list[str]:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead 
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list[str]
    """
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences]
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences

Comparison with `nltk`:

>>> from nltk.tokenize import sent_tokenize

Example 1: split_into_sentences is better here (because it explicitly covers a lot of cases):

>>> text = 'Some sentence. Mr. Holmes...This is a new sentence!And This is another one.. Hi '

>>> split_into_sentences(text)
['Some sentence.',
 'Mr. Holmes...',
 'This is a new sentence!',
 'And This is another one..',
 'Hi']

>>> sent_tokenize(text)
['Some sentence.',
 'Mr.',
 'Holmes...This is a new sentence!And This is another one.. Hi']

Example 2: nltk.tokenize.sent_tokenize is better here (because it uses an ML model):

>>> text = 'The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day.'

>>> split_into_sentences(text)
['The U.S.',
 'Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']

>>> sent_tokenize(text)
['The U.S. Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']

This is an awesome solution. However I added two more lines to it digits = "([0-9])" in the declaration of regular expressions and text = re.sub(digits + "[.]" + digits,"\\1\\2",text) in the function. Now it does not split the line at decimals such as 5.5. Thank you for this answer. — Ameya Kulkarni, Jul 17 '16 at 11:12
How did you parse the entire Huckleberry Fin? Where's that in text format? — PascalVKooten, Feb 04 '17 at 10:52
A great solution. In the function, I added if "e.g." in text: text = text.replace("e.g.","eg") if "i.e." in text: text = text.replace("i.e.","ie") and it fully solved my problem. — Sisay Chala, Jun 01 '17 at 08:09
Great solution with very helpful comments! Just to make it a little more robust though: `prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"`, `websites = "[.](com|net|org|io|gov|me|edu)"`, and `if "..." in text: text = text.replace("...","")` — Dascienz, Jan 26 '18 at 19:02
Can this function be made to see sentences like this as one sentence: When a child asks her mother "Where do babies come from?", what should one reply to her? — twhale, Apr 29 '18 at 06:54
This is super useful for running jobs with pig, where jython doesn't natively come with nltk. However it seems to completely discard non-ascii sentences. — John Jiang, Feb 05 '20 at 01:41
what about decimal numbers? `text = re.sub(" (\d+)[.](\d+) "," \\1\\2 ",text)` — Ahmad, Aug 08 '20 at 15:04
How to include this corner case also: `Thank you for contacting back. Request you to please help us with the transaction ID for $<***>.92 ? - Charlie.` — MAC, Aug 06 '21 at 05:20
Hmm sentence `"The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day."` is splitting after "U.S." for some reason. — geotheory, Oct 06 '21 at 09:12
Awsome, for some improvement, if the final sentence does not have a dot at the end, it is not included. "A sentence. A second sentence. A third sentence witout a final dot" --> ['A sentence.', 'A second sentence.'] — Xiiryo, Dec 07 '21 at 21:39
@Xiiryo did you solve the case when the last sentence does not have a dot ? — kely789456123, Jan 13 '22 at 08:29
@Xiiryo this case (where the last sentence does not end in a dot) is now solved — Vladimir Fokow, May 02 '23 at 14:14

score 94 · Answer 3 · answered Oct 30 '17 at 13:34

94

Instead of using regex for spliting the text into sentences, you can also use nltk library.

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

ref: https://stackoverflow.com/a/9474645/2877052

answered Oct 30 '17 at 13:34

Hassan Raza

3,025
22
35

Great, simpler and more reusable example than the accepted answer. – Eli O. Aug 08 '19 at 14:49
If you remove a space after a dot, tokenize.sent_tokenize() doesn't work, but tokenizer.tokenize() works! Hmm... – Leonid Ganeline Aug 08 '19 at 21:32
1

`for sentence in tokenize.sent_tokenize(text): print(sentence)` – Victoria Stuart Feb 27 '20 at 19:35
can i limit it to like 2 sentences only? – Sunil Garg Nov 11 '21 at 06:49
2

I found that nltk.tokenize.sent_tokenize gives results with faulty splitting sentences when it finds i.e., e.g. etc. and other abbreviations. – Tedo Vrbanec Mar 18 '22 at 10:12

score 20 · Answer 4 · edited Feb 10 '18 at 19:05

20

You can try using Spacy instead of regex. I use it and it does the job.

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

edited Feb 10 '18 at 19:05

hd1

33,938
5
80
91

answered Jan 10 '18 at 12:03

Elf

659
9
19

4

Space is mega great. but if you just need to separate into sentences passing the text to space will take too long if you are dealing with a data pipe – JFerro Jun 19 '19 at 19:22
@Berlines I agree but couldn't find any other library that does the job as clean as spaCy. But if you have any suggestion, I can try. – Elf Aug 16 '19 at 11:19
3

Also for the AWS Lambda Serverless users out there, spacy's support data files are many 100MB (english large is > 400MB) so you can't use things like this out of the box, very sadly (huge fan of Spacy here) – Julian H Jun 16 '20 at 04:12
1

I found spacy very bad splitting my texts into sentences, giving some phantom sentences containing just a dot. – Tedo Vrbanec Mar 18 '22 at 10:27

score 10 · Answer 5 · edited May 23 '17 at 12:18

Here is a middle of the road approach that doesn't rely on any external libraries. I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations, for example: '.' vs. '."'

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

I used Karl's find_all function from this entry: Find all occurrences of a substring in Python

Nice work. One not— “i.e.” translates as “That is”, not “For example”. — Leon Bambrick, May 03 '23 at 06:54

score 9 · Answer 6 · answered Jun 27 '19 at 09:09

You can also use sentence tokenization function in NLTK:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

I tried to use it because nltk is a very, very good library, but it fails on abbreviations where it splits but it should not. — Tedo Vrbanec, Mar 18 '22 at 10:29

mph · Answer 7 · 2023-05-11T20:58:04.527

9

I love spaCy to death, but I recently discovered two new approaches for sentence tokenization. One is BlingFire from Microsoft (incredibly fast), and the other is PySBD from AI2 (supremely accurate).

text = ...

from blingfire import text_to_sentences
sents = text_to_sentences(text).split('\n')

from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)

I separated 20k sentences using five different methods. Here are the elapsed times on an AMD Threadripper Linux machine:

spaCy Sentencizer: 1.16934s
spaCy Parse: 25.97063s
PySBD: 9.03505s
NLTK: 0.30512s
BlingFire: 0.07933s

UPDATE: I tried using BlingFire on all-lowercase text, and it failed miserably. I'm going to use PySBD in my projects for the time being.

edited May 11 '23 at 20:58

answered Sep 20 '22 at 18:52

mph

808
1
10
22

1

BlingFire [doesn't work](https://github.com/microsoft/BlingFire/issues/137) on ARM Linux or macOS currently. – HappyFace Feb 17 '23 at 10:36
When testing with a subset of the 51 English "golden rules" for sentence segmentation defined here (https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt, which is from https://github.com/diasks2/pragmatic_segmenter), BlingFire was the most accurate option and only slightly slower than NLTK with nltk.tokenize.sent_tokenize(text). Said subset was relevant for my purposes and included these 33 rules only: https://pastebin.com/raw/xqJATfcX – Spherical Cowboy Apr 11 '23 at 23:07

Rafe Kettler · Answer 8 · 2013-07-26T06:05:37.283

7

For simple cases (where sentences are terminated normally), this should work:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

The regex is *\. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences starts with a capital letter?)

edited Jul 26 '13 at 06:05

answered Jan 01 '11 at 22:34

Rafe Kettler

75,757
21
156
151

39

You can't think of a situation in English where a sentence doesn't end with a period? Imagine that! My response to that would be, "think again." (See what I did there?) – Ned Batchelder Jan 01 '11 at 22:37
@Ned wow, can't believe I was that stupid. I must be drunk or something. – Rafe Kettler Jan 01 '11 at 22:39
I am using Python 2.7.2 on Win 7 x86, and the regex in the above code gives me this error: `SyntaxError: EOL while scanning string literal`, pointing to the closing parenthesis (after `text`). Also, the regex you reference in your text does not exist in your code sample. – Sabuncu Jul 23 '13 at 18:35
1

The regex is not completely correct, as it should be ```r' *[\.\?!][\'"\)\]]* +'``` – fsociety Sep 09 '15 at 20:39
It may cause many problems and chunk a sentence to smaller chunks as well. Consider the case that we have " I paid $3.5 for this ice cream" them the chunks are " I paid $3" and "5 for this ice cream". use the default nltk sentence.tokenizer is safer! – Reihan_amn Feb 23 '18 at 19:19

score 6 · Answer 9 · edited Jan 09 '21 at 14:40

6

Using spacy:

import spacy

nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
    print(sent.string.strip())

edited Jan 09 '21 at 14:40

mic

4,300
1
19
25

answered Dec 15 '20 at 16:00

Gucci148

1,977
1
13
4

score 3 · Answer 10 · answered Mar 14 '21 at 21:36

Might as well throw this in, since this is the first post that showed up for sentence split by n sentences.

This works with a variable split length, which indicates the sentences that get joined together in the end.

import nltk
//nltk.download('punkt')
from more_itertools import windowed

split_length = 3 // 3 sentences for example 

elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
          txt = " ".join([t for t in seg if t])
          if len(txt) > 0:
                text_splits.append(txt)

score 3 · Answer 11 · answered Jun 30 '21 at 20:13

3

If NLTK's sent_tokenize is not a thing (e.g. needs a lot of GPU RAM on long text) and regex doesn't work properly across languages, sentence splitter might be try worth.

answered Jun 30 '21 at 20:13

TefoD

177
10

score 3 · Answer 12 · edited Dec 07 '21 at 13:37

3

Using Stanza a natural language processing library that works for many human languages.

import stanza

stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize')

doc = nlp(t_en)
for sentence in doc.sentences:
    print(sentence.text)

edited Dec 07 '21 at 13:37

Paul Brennan

2,638
4
19
26

answered Dec 07 '21 at 08:52

Ram Six

79
6

1

This, is fantastic, though if you are going to use this please use the multilingual model (https://stanfordnlp.github.io/stanza/langid.html) – Muneeb Ahmad Khurram Feb 08 '22 at 18:40

score 1 · Answer 13 · answered May 14 '12 at 01:59

No doubt that NLTK is the most suitable for the purpose. But getting started with NLTK is quite painful (But once you install it - you just reap the rewards)

So here is simple re based code available at http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question

Yey but this fails so easily, with: "Mr. Smith knows this is a sentence." — thomas, Feb 11 '14 at 10:15

score 1 · Answer 14 · answered May 13 '20 at 06:10

i hope this will help you on latin,chinese,arabic text

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|！|？|；|…|　|!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

Mazeen Muhammed · Answer 15 · 2020-08-07T13:20:41.560

Was working on similar task and came across this query, by following few links and working on few exercises for nltk the below code worked for me like magic.

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text)

output:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

Source: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

score 1 · Answer 16 · answered Oct 23 '20 at 08:38

1

Also, be wary of additional top level domains that aren't included in some of the answers above.

For example .info, .biz, .ru, .online will throw some sentence parsers but aren't included above.

Here's some info on frequency of top level domains: https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/

That could be addressed by editing the code above to read:

alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"

answered Oct 23 '20 at 08:38

cogijl

59
6

This is helpful information, but it might be more appropriate to add it as a short comment on the original answer. – vlz Oct 23 '20 at 13:32
1

That was my original plan, but I don't have the reputation for that yet apparently. Thought this might help someone so I thought I'd post it as best I could. If there's a way to do it and get around the "you need 50 reputation" first, I'd love to :) – cogijl Oct 26 '20 at 00:30

score 1 · Answer 17 · edited Feb 05 '21 at 05:46

You could make a new tokenizer for Russian (and some other languages) using this function:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

and then call it in this way:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

This badly splits text into words, not sentences. – dchest Apr 05 '22 at 09:28 — dchest, Apr 05 '22 at 09:28

score 0 · Answer 18 · answered Mar 14 '18 at 21:49

I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with "Mr. Smith" etc. Still, if someone wants a quick and dirty parser...

Inshaf Ahmed · Answer 19 · 2022-03-16T09:50:21.103

0

using spacy

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is first.This is second.This is Thired ')
for sentence in doc.sents:
  print(sentence)

But if you want to do get a sentence by index Example:

#don't work
 doc.sents[0]

Use

list( doc.sents)[0]

edited Mar 16 '22 at 09:50

answered Feb 20 '22 at 15:19

Inshaf Ahmed

401
6
8

1

It should be "for sentence in doc.sents:". – You_Donut Mar 14 '22 at 15:43

score 0 · Answer 20 · answered Feb 17 '23 at 10:27

0

Using Spacy v3.5:

import spacy

nlp_sentencizer = spacy.blank("en")
nlp_sentencizer.add_pipe("sentencizer")

text = "How are you today? I hope you have a great day"
tokens = nlp_sentencizer(text)
[str(sent) for sent in tokens.sents]

answered Feb 17 '23 at 10:27

HappyFace

3,439
2
24
43

How can I split a text into sentences?

20 Answers20

Comparison with `nltk`:

Linked

Related

How can I split a text into sentences?

20 Answers20

Comparison with nltk:

Linked

Related

Comparison with `nltk`: