Improving Sampled Matching through Character Context Sampling

IRIS

Sampled String Matching is a hybrid approach to the string matching problem, blending online and offline solutions. Among various sampling methods, Character Distance Sampling (CDS) is one of the fastest and most versatile techniques. In optimal conditions, CDS can achieve speedup of up to 100 times, while requiring minimal additional space — ranging from 10% to as little as 0.8% of the original text size. Furthermore, CDS is adaptable, effectively handling non-classical string matching problems and computing structural properties of strings such as periods and coverages. As with all sampling-based matching algorithms, a verification phase on the original text is necessary after the initial matching on the sampled strings. Often, the computational effort required for this verification phase can be substantial. In this article, we introduce a novel sampling method that tracks the context around each sampled location rather than the distances to those locations. This approach aims to reduce the number of candidate occurrences and the subsequent verification effort. Our experimental results indicate that the proposed method outperforms CDS, particularly for short patterns, achieving a speedup of between 15% and 40%.

Improving Sampled Matching through Character Context Sampling

Simone Faro;Thierry Lecroq;Francesco Pio Marino;Arianna Pavone;Stefano Scafiti

2024-01-01

Abstract

Sampled String Matching is a hybrid approach to the string matching problem, blending online and offline solutions. Among various sampling methods, Character Distance Sampling (CDS) is one of the fastest and most versatile techniques. In optimal conditions, CDS can achieve speedup of up to 100 times, while requiring minimal additional space — ranging from 10% to as little as 0.8% of the original text size. Furthermore, CDS is adaptable, effectively handling non-classical string matching problems and computing structural properties of strings such as periods and coverages. As with all sampling-based matching algorithms, a verification phase on the original text is necessary after the initial matching on the sampled strings. Often, the computational effort required for this verification phase can be substantial. In this article, we introduce a novel sampling method that tracks the context around each sampled location rather than the distances to those locations. This approach aims to reduce the number of candidate occurrences and the subsequent verification effort. Our experimental results indicate that the proposed method outperforms CDS, particularly for short patterns, achieving a speedup of between 15% and 40%.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2024

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
paper190.pdf accesso aperto Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 1.47 MB Formato Adobe PDF Visualizza/Apri	1.47 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/641832

Citazioni

ND

0

ND

social impact