Linguee_webscraper/main.py

import requests
from lxml import html
import string

charset = string.ascii_lowercase
url = 'https://www.linguee.fr/francais-anglais/search'
params = {
    'ch': 0
}

MAXIMUM_SUGGESTIONS = 4

entries = set()

def treatSuffixes(prefix):
    #print(prefix)
    for char in charset:
        base = prefix + char
        print(base)
        params['qe'] = base
        text = requests.get(url, params = params).text
        # Pay attention if change `base` elaboration to not allow unwanted folder file writing.
        with open(f'requests/{base}.html', 'w') as requestFile:
            requestFile.write(text)
        tree = html.fromstring(text)
        rows = tree.xpath('//div[@class="main_row"]')
        rowsLen = len(rows)
        assert rowsLen <= MAXIMUM_SUGGESTIONS, f'More than {MAXIMUM_SUGGESTIONS} rows!'
        interestingEntries = True
        for row in rows:
            item = row.xpath('div[@class="main_item"]')[0]
            entry = item.text_content()
            wordType = row.xpath('div[@class="main_wordtype"]')
            if wordType != []:
                if item.attrib['lc'] == 'FR' and not entry in entries:
                    print(len(entries), entry, wordType[0].text_content())
                    entries.add(entry)
                if not base in entry:
                    interestingEntries = False
        if rowsLen == MAXIMUM_SUGGESTIONS and interestingEntries:
            treatSuffixes(base)

treatSuffixes('')
Add `main.py` 2024-04-01 00:58:05 +02:00			`import requests`
			`from lxml import html`
First recursive enumeration try 2024-04-01 01:09:42 +02:00			`import string`
Add `main.py` 2024-04-01 00:58:05 +02:00
First recursive enumeration try 2024-04-01 01:09:42 +02:00			`charset = string.ascii_lowercase`
Add `main.py` 2024-04-01 00:58:05 +02:00			`url = 'https://www.linguee.fr/francais-anglais/search'`
			`params = {`
			`'ch': 0`
			`}`
First recursive enumeration try 2024-04-01 01:09:42 +02:00
			`MAXIMUM_SUGGESTIONS = 4`

			`entries = set()`

			`def treatSuffixes(prefix):`
Look only for French dictionary entries 2024-04-01 01:22:10 +02:00			`#print(prefix)`
First recursive enumeration try 2024-04-01 01:09:42 +02:00			`for char in charset:`
			`base = prefix + char`
Look only for French dictionary entries 2024-04-01 01:22:10 +02:00			`print(base)`
First recursive enumeration try 2024-04-01 01:09:42 +02:00			`params['qe'] = base`
			`text = requests.get(url, params = params).text`
Limit recursion to `base` having a word type having the original `base` in it 2024-04-01 01:41:09 +02:00			# Pay attention if change `base` elaboration to not allow unwanted folder file writing.
			`with open(f'requests/{base}.html', 'w') as requestFile:`
			`requestFile.write(text)`
First recursive enumeration try 2024-04-01 01:09:42 +02:00			`tree = html.fromstring(text)`
Look only for French dictionary entries 2024-04-01 01:22:10 +02:00			`rows = tree.xpath('//div[@class="main_row"]')`
			`rowsLen = len(rows)`
			`assert rowsLen <= MAXIMUM_SUGGESTIONS, f'More than {MAXIMUM_SUGGESTIONS} rows!'`
Limit recursion to `base` having a word type having the original `base` in it 2024-04-01 01:41:09 +02:00			`interestingEntries = True`
Look only for French dictionary entries 2024-04-01 01:22:10 +02:00			`for row in rows:`
			`item = row.xpath('div[@class="main_item"]')[0]`
First recursive enumeration try 2024-04-01 01:09:42 +02:00			`entry = item.text_content()`
Look only for French dictionary entries 2024-04-01 01:22:10 +02:00			`wordType = row.xpath('div[@class="main_wordtype"]')`
Limit recursion to `base` having a word type having the original `base` in it 2024-04-01 01:41:09 +02:00			`if wordType != []:`
			`if item.attrib['lc'] == 'FR' and not entry in entries:`
			`print(len(entries), entry, wordType[0].text_content())`
			`entries.add(entry)`
			`if not base in entry:`
			`interestingEntries = False`
			`if rowsLen == MAXIMUM_SUGGESTIONS and interestingEntries:`
First recursive enumeration try 2024-04-01 01:09:42 +02:00			`treatSuffixes(base)`

			`treatSuffixes('')`