Sampled String Matching is a hybrid approach to the string matching problem, blending online and offline solutions. Among various sampling methods, Character Distance Sampling (CDS) is one of the fastest and most versatile techniques. In optimal conditions, CDS can achieve speedup of up to 100 times, while requiring minimal additional space — ranging from 10% to as little as 0.8% of the original text size. Furthermore, CDS is adaptable, effectively handling non-classical string matching problems and computing structural properties of strings such as periods and coverages. As with all sampling-based matching algorithms, a verification phase on the original text is necessary after the initial matching on the sampled strings. Often, the computational effort required for this verification phase can be substantial. In this article, we introduce a novel sampling method that tracks the context around each sampled location rather than the distances to those locations. This approach aims to reduce the number of candidate occurrences and the subsequent verification effort. Our experimental results indicate that the proposed method outperforms CDS, particularly for short patterns, achieving a speedup of between 15% and 40%.
Improving Sampled Matching through Character Context Sampling
Simone Faro;Francesco Pio Marino;Stefano Scafiti
2024-01-01
Abstract
Sampled String Matching is a hybrid approach to the string matching problem, blending online and offline solutions. Among various sampling methods, Character Distance Sampling (CDS) is one of the fastest and most versatile techniques. In optimal conditions, CDS can achieve speedup of up to 100 times, while requiring minimal additional space — ranging from 10% to as little as 0.8% of the original text size. Furthermore, CDS is adaptable, effectively handling non-classical string matching problems and computing structural properties of strings such as periods and coverages. As with all sampling-based matching algorithms, a verification phase on the original text is necessary after the initial matching on the sampled strings. Often, the computational effort required for this verification phase can be substantial. In this article, we introduce a novel sampling method that tracks the context around each sampled location rather than the distances to those locations. This approach aims to reduce the number of candidate occurrences and the subsequent verification effort. Our experimental results indicate that the proposed method outperforms CDS, particularly for short patterns, achieving a speedup of between 15% and 40%.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.