Matching Words
From Erlang Community
| Revision as of 18:52, 24 September 2006 (edit) Ayrnieu (Talk | contribs) (answer a bit differently, also removing ref to PCRE. The Cook Book really isn't the place to ask for language enhancements.) ← Previous diff |
Current revision (18:54, 24 September 2006) (edit) (undo) Ayrnieu (Talk | contribs) m (typo) |
||
| Line 20: | Line 20: | ||
| 1> words("'alpha-beta gamma theta", Words_1). | 1> words("'alpha-beta gamma theta", Words_1). | ||
| ["'alpha-beta","gamma","theta"] | ["'alpha-beta","gamma","theta"] | ||
| - | + | 2> words("'alpha-beta&or gamma theta", Words_2). | |
| ["'alpha-beta", "or", "gamma", "theta"] | ["'alpha-beta", "or", "gamma", "theta"] | ||
| </code> | </code> | ||
Current revision
[edit] Problem
You want to select words from a string.
[edit] Solution
Determine the defining features of a word for your specific application, then write a regular expression that models this idea.
matches(H,{match,M}) -> matches(H,M,[]).
matches(_,[],Acc) -> Acc;
matches(H,[{I,L}|T],Acc) ->
matches(H,T,[lists:sublist(H,I,L)|Acc]).
words(String, Regexp) -> matches(String,regexp:matches(String, Regexp)).
Words_1 = "[^ ]+". % as many non-whitespace bytes as possible
Words_2 = "[A-Za-z'-]+". % as many letters, apostrophes, and hyphens
1> words("'alpha-beta gamma theta", Words_1).
["'alpha-beta","gamma","theta"]
2> words("'alpha-beta&or gamma theta", Words_2).
["'alpha-beta", "or", "gamma", "theta"]
|
[edit] Discussion
Erlang does not have a built-in definition for words in strings. On the one hand, this is inconvenient since you have to define your own meaning of "word". On the other hand, this is the correct behavior since the concept of words varies significantly between applications, locales, encodings, and input source.
The meaning of "word" in a particular application's context can vary significantly. Languages usually support pluralization of singular nouns, attach posessive modifiers, allow hyphenated word combinations, and so forth. The regular expression used must reflect the expected range of words to be encountered.

Digg It
Del.icio.us
Reddit
Facebook
Stumble Upon
Technorati

