IBX-11441: Provided proper index limit for the StringField for search engines#744
IBX-11441: Provided proper index limit for the StringField for search engines#744
StringField for search engines#744Conversation
konradoboza
left a comment
There was a problem hiding this comment.
I wasn't personally able to find any reasoning not to do this that way...

Maybe @alongosz can give some historical explanation if there is any. 😉
StringField for search enginesStringField for search engines
alongosz
left a comment
There was a problem hiding this comment.
Two possibilities come to mind:
- some early Solr version had different limit
- it was done to avoid inflating search engine, as the full text values were supposed to be supported by full text search fields only
I don't see in 2026 any reason to keep that old limit :)
| * @param mixed $string | ||
| */ | ||
| private function extractShortText($string) | ||
| private function extractText($string): string |
There was a problem hiding this comment.
I'm against changing the method name here, because what we are doing is we're extracting the first paragraph - and another "field" (see calling location) contains the whole text.
(that's what strtok does in this context)
| private function extractText($string): string | |
| private function extractShortText($string): string |
or straight up declare what we're doing:
| private function extractText($string): string | |
| private function extractFirstParagraph($string): string |
There was a problem hiding this comment.
Actually, I think the method name is correct, but implementation is not. Imo we should store whole text, not just first paragraph by simply removing new lines. Wdyt @Steveb-p @konradoboza @alongosz?
| ); | ||
|
|
||
| // Enforce Lucene's bytes MAX_TERM_LENGTH to avoid silent indexing failures | ||
| return mb_strcut((string)$value, 0, self::MAX_TERM_LENGTH); |
There was a problem hiding this comment.
Databases when truncating usually emit a warning or straight up fail (depending on how strict they are).
I believe we should do the same if we're gonna do that.
Side note: I was considering if moving this truncation here might a BC break, but I feel it's not.
There was a problem hiding this comment.
The original issue was about preventing fails: ezsystems/ezpublish-kernel#1858 therefore I think logging might be better solution, added.
8453d37 to
10cdaf6
Compare
| calls: | ||
| - [setLogger, ['@logger']] |
There was a problem hiding this comment.
This should be done by either autoconfigure: true or explicit monolog tagging with channel AFAIR.
There was a problem hiding this comment.
correct me if I'm wrong but since RemoteIdentifierMapper and MultipleStringMapper extend StringMapper, they inherit its constructor. Without autowiring in the YAML, the container couldn't inject $logger into those child classes, causing them (and StringMapper itself) to fail silently — removing all string-type mappers from the registry, so I would say we need setLogger here?
|



Description:
Solr and ES string fields are based on the Lucene architecture, therefore bytes limit for both is
32766.In the case of TextBlock it never made sense to set the limit to
255, additionally we didn't have such a limit in TextLine.For QA:
Documentation: