Unstructured-IO/unstructured

feat/clean_newline

Open

#2513 aperta il 6 feb 2024

Vedi su GitHub
 (4 commenti) (0 reazioni) (0 assegnatari)HTML (1232 fork)batch import
enhancementgood first issue

Metriche repository

Star
 (14.711 star)
Metriche merge PR
 (Merge medio 21h 46m) (7 PR mergiate in 30 g)

Descrizione

Is your feature request related to a problem? Please describe. Since often words that continue on the following line are described as a character followed by a dash and 1+ whitespaces it would be useful to have a function clean_newline that concatenates the text on newline.

def clean_newline(text: str, pattern: str = r"(\w+)-\s+(\w+)" ) -> str:
    """
    The `clean_newline` function removes the hyphen and whitespace between two words in a given text.
    
    :param text: A string that contains the text to be cleaned
    :type text: str
    :return: a modified version of the input text where any occurrence of a word followed by a hyphen
    and whitespace, followed by another word, is replaced with just the two words concatenated together.
    """
    return re.sub(pattern, r'\1\2', text)

Guida contributor