Advanced Queries#
Where Attribute#
The where attribute is the most used attribute to select content by applying different conditions to the available fields. Below there are some code snippets that help building a where statement, and notes that help modify the syntax according to the need.
Range of dates#
It works with the standard SQL syntax. If updates will be collected in the future, avoid using an end-date. Alternative fields are: modification_datetime and ingestion_datetime.
publication_datetime >= '2010-01-01 00:00:00' AND publication_datetime <= '2020-06-30 23:59:59'
Filter by source_codes#
The following clause is useful to select sources by their individual code.
UPPER(source_code) IN ('AASFNE', 'HTACCF', 'NLADLW', 'ADVTSR', 'AFNROL', 'AGEEOL', 'AGEE', 'HNASNI', 'APRS', 'ASXTEX', 'AUSTOL')
In case sources will be selected by their category or source family, a better option is using restrictor_codes. This field is not in the documentation, but the CSE or Integration Team can provide more information like source family codes
REGEXP_CONTAINS(restrictor_codes, r'(?i)(^|,)(jpost|nytf|wp|latm|j)($|,)')
Filter by subject_codes#
REGEXP_CONTAINS(subject_codes, r'(?i)(^|,)(mcat|ccat|ecat|gglobe|ghea|ghnwi|gcns|gpir|gdatap|greest|grisk|gsci|gspace|gtrans)($|,)')
Filtering by the region where the source is headquarted#
REGEXP_CONTAINS(region_of_origin, r'(?i)(aust|spain|italy|usa|uk)')
Filtering by language#
LOWER(language_code) IN ('en', 'es', 'it')
Filtering by company codes#
This is applicable to any company-related fields (about, occur or company_codes and other combinations with identifiers - ISIN, CUSIP…).
REGEXP_CONTAINS(company_codes, r'(?i)(^|,)(agbpet|agip|agphng|agpnme|agzgi|altgaz|bbor|brnene|distrg|eenivm|egapg|enichm|enie|enimnt)($|,)')
In case the interest is to ensure at least one company is tagged (the field is not empty), the expressions looks like this
LENGTH(company_codes) > 2
Filtering for content with at least 1 relevant company
LENGTH(company_codes_about) > 0
Filtering by Industry code#
REGEXP_CONTAINS(industry_codes, r'(?i)(^|,)(i1|i25121|i2567)($|,)')
Filtering by Executive codes#
REGEXP_CONTAINS(LOWER(person_codes), r'(?i)(^|,)(76064380|2349856)($|,)')
Filtering by the region the article is about#
REGEXP_CONTAINS(region_codes, r'(?i)(^|,)(aust|spain|italy|usa|uk)($|,)')
Filtering by terms in full-text (Keyword search)#
REGEXP_CONTAINS(CONCAT(title, ' ', IFNULL(snippet, ''), ' ', IFNULL(body, '')), r'(?i)(^|\b)(economic|economy|regulation|deficit|budget\W+tax|central\W+bank)($|.|\b)')
More examples are available in the Data Selection Samples in the Dow Jones Developer Portal (https://developer.dowjones.com/site/docs/data_selection_samples/index.gsp#)
Building the where statement. Python concatenates the strings when inside the parenthesis. Mind the extra space at the end of each string.