Predição de prisões em ocorrências criminais em Chicago¶
Projeto Final: Big Data e Computação em Nuvem
Por: Ilana Garcia, Izabelle Silva, Julia Navarro, Lívia Bertoni
Contexto do projeto¶
Este projeto trabalha com o conjunto de dados de crimes registrados na cidade de Chicago entre 2001 e 2023, disponibilizado pelo Chicago Police Department e organizado na plataforma Kaggle.
Cada linha representa um incidente criminal reportado, contendo informações essenciais para análise temporal, espacial e contextual:
- variáveis temporais (data, hora, ano);
- tipo de crime e descrição detalhada;
- características do local (residência, comércio, escola, via pública etc.);
- indicadores importantes como prisão realizada crime doméstico;
- localização aproximada (quarteirão, distrito policial, área comunitária, latitude e longitude).
Para proteção da privacidade das vítimas, os endereços são aproximados: o nível máximo divulgado é o do quarteirão, e coordenadas são deslocadas.
Visão geral do dataset¶
Abaixo, um resumo das principais variáveis que serão utilizadas na análise:
| Variável | Descrição | Tipo | Exemplo |
|---|---|---|---|
ID |
Identificador único do incidente | Inteiro | 12345678 |
Date |
Data e horário em que o crime foi registrado | Data/hora | 2019-07-15 23:45:00 |
Primary Type |
Categoria principal do crime | Categórica | THEFT, BATTERY, ROBBERY |
Description |
Descrição mais detalhada do tipo de crime | Categórica | OVER $500, SIMPLE, ARMED: HANDGUN |
Location Description |
Tipo de local onde ocorreu o crime | Categórica | STREET, RESIDENCE, SIDEWALK, SCHOOL |
Arrest |
Indica se houve prisão no incidente | Binária | TRUE / FALSE |
Domestic |
Indica se o crime está relacionado à violência doméstica | Binária | TRUE / FALSE |
Beat / District / Ward / Community Area |
Códigos administrativos de região policial e área comunitária | Inteiro / categórico | Beat 111, District 01, Community Area 32 |
Latitude / Longitude |
Coordenadas aproximadas do quarteirão do incidente | Numérico | 41.8781, -87.6298 |
Year |
Ano em que o crime foi registrado | Inteiro | 2008, 2015, 2023 |
Objetivo da análise¶
A análise busca responder principalmente a duas questões:
Como os crimes em Chicago se comportam ao longo do tempo e quais padrões podem ser observados?
- Quais anos apresentam maior volume de ocorrências?
- Como os incidentes variam por dia da semana e horário?
- Há categorias de crime particularmente frequentes?
É possível prever a probabilidade de prisão em um incidente a partir de suas características?
- A partir de atributos como tipo de crime, horário, local e indicadores adicionais, qual a chance de que o registro resulte em prisão?
- Como lidar com o forte desbalanceamento entre incidentes com e sem prisão?
- Quais variáveis parecem contribuir mais para o modelo preditivo?
Configuração do Spark¶
Nesta seção configuramos e inicializamos uma sessão Spark — necessária para executar operações distribuídas de leitura, transformação e modelagem com DataFrames.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Crimes Data Analysis") \
.getOrCreate()
spark
SparkSession - in-memory
Importação do Dataset (KaggleHub)¶
Para iniciar o projeto, precisamos baixar a base de dados Crimes In Chicago (2001 to 2023) diretamente do Kaggle.
Usamos a biblioteca kagglehub, que permite fazer o download automático da versão mais recente do dataset, garantindo praticidade e reprodutibilidade.
O código abaixo:
- importa o pacote
kagglehub; - baixa o dataset solicitado;
- salva o caminho local onde os arquivos foram armazenados;
- imprime esse caminho para ser usado nas próximas etapas do notebook.
Execute a célula de código logo abaixo:
import kagglehub
# Download latest version
path = kagglehub.dataset_download("utkarshx27/crimes-2001-to-present")
print("Path to dataset files:", path)
/opt/jupyterhub-venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Path to dataset files: /tmp/pads/.cache/kagglehub/datasets/utkarshx27/crimes-2001-to-present/versions/1
from pyspark.sql.types import *
schema = StructType([
StructField("ID", IntegerType(), True),
StructField("Case Number", StringType(), True),
StructField("Date", StringType(), True),
StructField("Block", StringType(), True),
StructField("IUCR", StringType(), True),
StructField("Primary Type", StringType(), True),
StructField("Description", StringType(), True),
StructField("Location Description", StringType(), True),
StructField("Arrest", BooleanType(), True),
StructField("Domestic", BooleanType(), True),
StructField("Beat", IntegerType(), True),
StructField("District", IntegerType(), True),
StructField("Ward", IntegerType(), True),
StructField("Community Area", IntegerType(), True),
StructField("FBI Code", StringType(), True),
StructField("X Coordinate", FloatType(), True),
StructField("Y Coordinate", FloatType(), True),
StructField("Year", IntegerType(), True),
StructField("Updated On", StringType(), True),
StructField("Latitude", FloatType(), True),
StructField("Longitude", FloatType(), True),
StructField("Location", StringType(), True)
])
crimes_df = spark.read.csv(path, header=True, schema=schema)
crimes_df.show()
+--------+-----------+--------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+ | ID|Case Number| Date| Block|IUCR| Primary Type| Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year| Updated On| Latitude| Longitude| Location| +--------+-----------+--------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+ |10224738| HY411648|09/05/2015 01:30:...| 043XX S WOOD ST|0486| BATTERY|DOMESTIC BATTERY ...| RESIDENCE| false| true| 924| 9| 12| 61| 08B| 1165074.0| 1875917.0|2015|02/10/2018 03:50:...|41.815117| -87.67|(41.815117282, -8...| |10224739| HY411615|09/04/2015 11:30:...| 008XX N CENTRAL AVE|0870| THEFT| POCKET-PICKING| CTA BUS| false| false|1511| 15| 29| 25| 06| 1138875.0| 1904869.0|2015|02/10/2018 03:50:...| 41.89508| -87.7654|(41.895080471, -8...| |11646166| JC213529|09/01/2018 12:01:...|082XX S INGLESIDE...|0810| THEFT| OVER $500| RESIDENCE| false| true| 631| 6| 8| 44| 06| NULL| NULL|2018|04/06/2019 04:04:...| NULL| NULL| NULL| |10224740| HY411595|09/05/2015 12:45:...| 035XX W BARRY AVE|2023| NARCOTICS|POSS: HEROIN(BRN/...| SIDEWALK| true| false|1412| 14| 35| 21| 18| 1152037.0| 1920384.0|2015|02/10/2018 03:50:...|41.937405| -87.71665|(41.937405765, -8...| |10224741| HY411610|09/05/2015 01:00:...| 0000X N LARAMIE AVE|0560| ASSAULT| SIMPLE| APARTMENT| false| true|1522| 15| 28| 25| 08A| 1141706.0| 1900086.0|2015|02/10/2018 03:50:...|41.881905| -87.75512|(41.881903443, -8...| |10224742| HY411435|09/05/2015 10:55:...| 082XX S LOOMIS BLVD|0610| BURGLARY| FORCIBLE ENTRY| RESIDENCE| false| false| 614| 6| 21| 71| 05| 1168430.0| 1850165.0|2015|02/10/2018 03:50:...|41.744377| -87.65843|(41.744378879, -8...| |10224743| HY411629|09/04/2015 06:00:...|021XX W CHURCHILL ST|0620| BURGLARY| UNLAWFUL ENTRY| RESIDENCE-GARAGE| false| false|1434| 14| 32| 24| 05| 1161628.0| 1912157.0|2015|02/10/2018 03:50:...|41.914635| -87.68163|(41.914635603, -8...| |10224744| HY411605|09/05/2015 01:00:...| 025XX W CERMAK RD|0860| THEFT| RETAIL THEFT| GROCERY FOOD STORE| true| false|1034| 10| 25| 31| 06| 1159734.0| 1889313.0|2015|09/17/2015 11:37:...| 41.85199| -87.68922|(41.851988885, -8...| |10224745| HY411654|09/05/2015 11:30:...|031XX W WASHINGTO...|0320| ROBBERY|STRONGARM - NO WE...| STREET| false| true|1222| 12| 27| 27| 03| 1155536.0| 1900515.0|2015|02/10/2018 03:50:...|41.882812| -87.70432|(41.88281374, -87...| |11645836| JC212333|05/01/2016 12:25:...| 055XX S ROCKWELL ST|1153|DECEPTIVE PRACTICE|FINANCIAL IDENTIT...| NULL| false| false| 824| 8| 15| 63| 11| NULL| NULL|2016|04/06/2019 04:04:...| NULL| NULL| NULL| |10224746| HY411662|09/05/2015 02:00:...| 071XX S PULASKI RD|0820| THEFT| $500 AND UNDER|PARKING LOT/GARAG...| false| false| 833| 8| 13| 65| 06| 1150938.0| 1857056.0|2015|02/10/2018 03:50:...| 41.76365| -87.72234|(41.763647552, -8...| |10224749| HY411626|09/05/2015 11:00:...|052XX N MILWAUKEE...|0460| BATTERY| SIMPLE| SMALL RETAIL STORE| false| false|1623| 16| 45| 11| 08B| 1137969.0| 1934340.0|2015|02/10/2018 03:50:...|41.975967| -87.76801|(41.975968415, -8...| |10224750| HY411632|09/05/2015 03:00:...| 0000X W 103RD ST|2820| OTHER OFFENSE| TELEPHONE THREAT| APARTMENT| false| true| 512| 5| 34| 49| 26| 1177871.0| 1836676.0|2015|02/10/2018 03:50:...|41.707153|-87.624245|(41.707154919, -8...| |10224751| HY411566|09/05/2015 12:50:...| 013XX E 47TH ST|0486| BATTERY|DOMESTIC BATTERY ...| STREET| false| true| 222| 2| 4| 39| 08B| 1185907.0| 1874105.0|2015|02/10/2018 03:50:...|41.809677|-87.593636|(41.809678314, -8...| |10224752| HY411601|09/03/2015 01:00:...| 020XX W SCHILLER ST|0810| THEFT| OVER $500| STREET| false| false|1424| 14| 1| 24| 06| 1162574.0| 1909428.0|2015|02/10/2018 03:50:...|41.907127| -87.67823|(41.907127255, -8...| |10224753| HY411489|09/05/2015 11:45:...| 080XX S JUSTINE ST|0497| BATTERY|AGGRAVATED DOMEST...| APARTMENT| false| false| 612| 6| 21| 71| 04B| 1167400.0| 1851512.0|2015|02/10/2018 03:50:...|41.748096| -87.66216|(41.748097343, -8...| |10224754| HY411656|09/05/2015 01:30:...|007XX N LEAMINGTO...|1320| CRIMINAL DAMAGE| TO VEHICLE| STREET| false| false|1531| 15| 28| 25| 14| 1141889.0| 1904448.0|2015|02/10/2018 03:50:...| 41.89387| -87.75434|(41.893869916, -8...| |10224756| HY410094|07/08/2015 12:00:...|103XX S TORRENCE AVE|0620| BURGLARY| UNLAWFUL ENTRY| OTHER| false| false| 434| 4| 10| 51| 05| 1195508.0| 1836950.0|2015|02/10/2018 03:50:...| 41.70749| -87.55965|(41.707490122, -8...| |10224757| HY411388|09/05/2015 09:55:...| 088XX S PAULINA ST|0610| BURGLARY| FORCIBLE ENTRY| RESIDENCE| true| false|2221| 22| 21| 71| 05| 1166554.0| 1846067.0|2015|02/10/2018 03:50:...|41.733173| -87.66542|(41.733173536, -8...| |10224758| HY411568|09/05/2015 12:35:...| 059XX W GRACE ST|0486| BATTERY|DOMESTIC BATTERY ...| STREET| false| true|1633| 16| 38| 15| 08B| 1136014.0| 1924656.0|2015|02/10/2018 03:50:...| 41.94943| -87.77544|(41.949429769, -8...| +--------+-----------+--------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+ only showing top 20 rows
import pyspark.sql.functions as sf
crimes_df = (
crimes_df.withColumn('Date', sf.to_timestamp('Date', 'MM/dd/yyyy hh:mm:ss a'))
.withColumn('Updated On', sf.to_timestamp('Updated On', 'MM/dd/yyyy hh:mm:ss a'))
)
crimes_df.printSchema()
root |-- ID: integer (nullable = true) |-- Case Number: string (nullable = true) |-- Date: timestamp (nullable = true) |-- Block: string (nullable = true) |-- IUCR: string (nullable = true) |-- Primary Type: string (nullable = true) |-- Description: string (nullable = true) |-- Location Description: string (nullable = true) |-- Arrest: boolean (nullable = true) |-- Domestic: boolean (nullable = true) |-- Beat: integer (nullable = true) |-- District: integer (nullable = true) |-- Ward: integer (nullable = true) |-- Community Area: integer (nullable = true) |-- FBI Code: string (nullable = true) |-- X Coordinate: float (nullable = true) |-- Y Coordinate: float (nullable = true) |-- Year: integer (nullable = true) |-- Updated On: timestamp (nullable = true) |-- Latitude: float (nullable = true) |-- Longitude: float (nullable = true) |-- Location: string (nullable = true)
crimes_df.show()
+--------+-----------+-------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+----------+--------------------+ | ID|Case Number| Date| Block|IUCR| Primary Type| Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year| Updated On| Latitude| Longitude| Location| +--------+-----------+-------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+----------+--------------------+ |10224738| HY411648|2015-09-05 13:30:00| 043XX S WOOD ST|0486| BATTERY|DOMESTIC BATTERY ...| RESIDENCE| false| true| 924| 9| 12| 61| 08B| 1165074.0| 1875917.0|2015|2018-02-10 15:50:01|41.815117| -87.67|(41.815117282, -8...| |10224739| HY411615|2015-09-04 11:30:00| 008XX N CENTRAL AVE|0870| THEFT| POCKET-PICKING| CTA BUS| false| false|1511| 15| 29| 25| 06| 1138875.0| 1904869.0|2015|2018-02-10 15:50:01| 41.89508| -87.7654|(41.895080471, -8...| |11646166| JC213529|2018-09-01 00:01:00|082XX S INGLESIDE...|0810| THEFT| OVER $500| RESIDENCE| false| true| 631| 6| 8| 44| 06| NULL| NULL|2018|2019-04-06 16:04:43| NULL| NULL| NULL| |10224740| HY411595|2015-09-05 12:45:00| 035XX W BARRY AVE|2023| NARCOTICS|POSS: HEROIN(BRN/...| SIDEWALK| true| false|1412| 14| 35| 21| 18| 1152037.0| 1920384.0|2015|2018-02-10 15:50:01|41.937405| -87.71665|(41.937405765, -8...| |10224741| HY411610|2015-09-05 13:00:00| 0000X N LARAMIE AVE|0560| ASSAULT| SIMPLE| APARTMENT| false| true|1522| 15| 28| 25| 08A| 1141706.0| 1900086.0|2015|2018-02-10 15:50:01|41.881905| -87.75512|(41.881903443, -8...| |10224742| HY411435|2015-09-05 10:55:00| 082XX S LOOMIS BLVD|0610| BURGLARY| FORCIBLE ENTRY| RESIDENCE| false| false| 614| 6| 21| 71| 05| 1168430.0| 1850165.0|2015|2018-02-10 15:50:01|41.744377| -87.65843|(41.744378879, -8...| |10224743| HY411629|2015-09-04 18:00:00|021XX W CHURCHILL ST|0620| BURGLARY| UNLAWFUL ENTRY| RESIDENCE-GARAGE| false| false|1434| 14| 32| 24| 05| 1161628.0| 1912157.0|2015|2018-02-10 15:50:01|41.914635| -87.68163|(41.914635603, -8...| |10224744| HY411605|2015-09-05 13:00:00| 025XX W CERMAK RD|0860| THEFT| RETAIL THEFT| GROCERY FOOD STORE| true| false|1034| 10| 25| 31| 06| 1159734.0| 1889313.0|2015|2015-09-17 11:37:18| 41.85199| -87.68922|(41.851988885, -8...| |10224745| HY411654|2015-09-05 11:30:00|031XX W WASHINGTO...|0320| ROBBERY|STRONGARM - NO WE...| STREET| false| true|1222| 12| 27| 27| 03| 1155536.0| 1900515.0|2015|2018-02-10 15:50:01|41.882812| -87.70432|(41.88281374, -87...| |11645836| JC212333|2016-05-01 00:25:00| 055XX S ROCKWELL ST|1153|DECEPTIVE PRACTICE|FINANCIAL IDENTIT...| NULL| false| false| 824| 8| 15| 63| 11| NULL| NULL|2016|2019-04-06 16:04:43| NULL| NULL| NULL| |10224746| HY411662|2015-09-05 14:00:00| 071XX S PULASKI RD|0820| THEFT| $500 AND UNDER|PARKING LOT/GARAG...| false| false| 833| 8| 13| 65| 06| 1150938.0| 1857056.0|2015|2018-02-10 15:50:01| 41.76365| -87.72234|(41.763647552, -8...| |10224749| HY411626|2015-09-05 11:00:00|052XX N MILWAUKEE...|0460| BATTERY| SIMPLE| SMALL RETAIL STORE| false| false|1623| 16| 45| 11| 08B| 1137969.0| 1934340.0|2015|2018-02-10 15:50:01|41.975967| -87.76801|(41.975968415, -8...| |10224750| HY411632|2015-09-05 03:00:00| 0000X W 103RD ST|2820| OTHER OFFENSE| TELEPHONE THREAT| APARTMENT| false| true| 512| 5| 34| 49| 26| 1177871.0| 1836676.0|2015|2018-02-10 15:50:01|41.707153|-87.624245|(41.707154919, -8...| |10224751| HY411566|2015-09-05 12:50:00| 013XX E 47TH ST|0486| BATTERY|DOMESTIC BATTERY ...| STREET| false| true| 222| 2| 4| 39| 08B| 1185907.0| 1874105.0|2015|2018-02-10 15:50:01|41.809677|-87.593636|(41.809678314, -8...| |10224752| HY411601|2015-09-03 13:00:00| 020XX W SCHILLER ST|0810| THEFT| OVER $500| STREET| false| false|1424| 14| 1| 24| 06| 1162574.0| 1909428.0|2015|2018-02-10 15:50:01|41.907127| -87.67823|(41.907127255, -8...| |10224753| HY411489|2015-09-05 11:45:00| 080XX S JUSTINE ST|0497| BATTERY|AGGRAVATED DOMEST...| APARTMENT| false| false| 612| 6| 21| 71| 04B| 1167400.0| 1851512.0|2015|2018-02-10 15:50:01|41.748096| -87.66216|(41.748097343, -8...| |10224754| HY411656|2015-09-05 13:30:00|007XX N LEAMINGTO...|1320| CRIMINAL DAMAGE| TO VEHICLE| STREET| false| false|1531| 15| 28| 25| 14| 1141889.0| 1904448.0|2015|2018-02-10 15:50:01| 41.89387| -87.75434|(41.893869916, -8...| |10224756| HY410094|2015-07-08 00:00:00|103XX S TORRENCE AVE|0620| BURGLARY| UNLAWFUL ENTRY| OTHER| false| false| 434| 4| 10| 51| 05| 1195508.0| 1836950.0|2015|2018-02-10 15:50:01| 41.70749| -87.55965|(41.707490122, -8...| |10224757| HY411388|2015-09-05 09:55:00| 088XX S PAULINA ST|0610| BURGLARY| FORCIBLE ENTRY| RESIDENCE| true| false|2221| 22| 21| 71| 05| 1166554.0| 1846067.0|2015|2018-02-10 15:50:01|41.733173| -87.66542|(41.733173536, -8...| |10224758| HY411568|2015-09-05 12:35:00| 059XX W GRACE ST|0486| BATTERY|DOMESTIC BATTERY ...| STREET| false| true|1633| 16| 38| 15| 08B| 1136014.0| 1924656.0|2015|2018-02-10 15:50:01| 41.94943| -87.77544|(41.949429769, -8...| +--------+-----------+-------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+----------+--------------------+ only showing top 20 rows
from pyspark.sql import functions as F
crimes_df = crimes_df.select(
F.col("ID").alias("id"),
F.col("Case Number").alias("case_number"),
F.col("Date").alias("date"),
F.col("Block").alias("block"),
F.col("IUCR").alias("iucr"),
F.col("Primary Type").alias("primary_type"),
F.col("Description").alias("description"),
F.col("Location Description").alias("location_description"),
F.col("Arrest").alias("arrest"),
F.col("Domestic").alias("domestic"),
F.col("Beat").alias("beat"),
F.col("District").alias("district"),
F.col("Ward").alias("ward"),
F.col("Community Area").alias("community_area"),
F.col("FBI Code").alias("fbi_code"),
F.col("X Coordinate").alias("x_coordinate"),
F.col("Y Coordinate").alias("y_coordinate"),
F.col("Year").alias("year"),
F.col("Updated On").alias("updated_on"),
F.col("Latitude").alias("latitude"),
F.col("Longitude").alias("longitude"),
F.col("Location").alias("location"))
crimes_df.limit(10).toPandas()
| id | case_number | date | block | iucr | primary_type | description | location_description | arrest | domestic | ... | ward | community_area | fbi_code | x_coordinate | y_coordinate | year | updated_on | latitude | longitude | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10224738 | HY411648 | 2015-09-05 13:30:00 | 043XX S WOOD ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | RESIDENCE | False | True | ... | 12 | 61 | 08B | 1165074.0 | 1875917.0 | 2015 | 2018-02-10 15:50:01 | 41.815117 | -87.669998 | (41.815117282, -87.669999562) |
| 1 | 10224739 | HY411615 | 2015-09-04 11:30:00 | 008XX N CENTRAL AVE | 0870 | THEFT | POCKET-PICKING | CTA BUS | False | False | ... | 29 | 25 | 06 | 1138875.0 | 1904869.0 | 2015 | 2018-02-10 15:50:01 | 41.895081 | -87.765404 | (41.895080471, -87.765400451) |
| 2 | 11646166 | JC213529 | 2018-09-01 00:01:00 | 082XX S INGLESIDE AVE | 0810 | THEFT | OVER $500 | RESIDENCE | False | True | ... | 8 | 44 | 06 | NaN | NaN | 2018 | 2019-04-06 16:04:43 | NaN | NaN | None |
| 3 | 10224740 | HY411595 | 2015-09-05 12:45:00 | 035XX W BARRY AVE | 2023 | NARCOTICS | POSS: HEROIN(BRN/TAN) | SIDEWALK | True | False | ... | 35 | 21 | 18 | 1152037.0 | 1920384.0 | 2015 | 2018-02-10 15:50:01 | 41.937405 | -87.716652 | (41.937405765, -87.716649687) |
| 4 | 10224741 | HY411610 | 2015-09-05 13:00:00 | 0000X N LARAMIE AVE | 0560 | ASSAULT | SIMPLE | APARTMENT | False | True | ... | 28 | 25 | 08A | 1141706.0 | 1900086.0 | 2015 | 2018-02-10 15:50:01 | 41.881905 | -87.755119 | (41.881903443, -87.755121152) |
| 5 | 10224742 | HY411435 | 2015-09-05 10:55:00 | 082XX S LOOMIS BLVD | 0610 | BURGLARY | FORCIBLE ENTRY | RESIDENCE | False | False | ... | 21 | 71 | 05 | 1168430.0 | 1850165.0 | 2015 | 2018-02-10 15:50:01 | 41.744377 | -87.658432 | (41.744378879, -87.658430635) |
| 6 | 10224743 | HY411629 | 2015-09-04 18:00:00 | 021XX W CHURCHILL ST | 0620 | BURGLARY | UNLAWFUL ENTRY | RESIDENCE-GARAGE | False | False | ... | 32 | 24 | 05 | 1161628.0 | 1912157.0 | 2015 | 2018-02-10 15:50:01 | 41.914635 | -87.681633 | (41.914635603, -87.681630909) |
| 7 | 10224744 | HY411605 | 2015-09-05 13:00:00 | 025XX W CERMAK RD | 0860 | THEFT | RETAIL THEFT | GROCERY FOOD STORE | True | False | ... | 25 | 31 | 06 | 1159734.0 | 1889313.0 | 2015 | 2015-09-17 11:37:18 | 41.851990 | -87.689217 | (41.851988885, -87.689219118) |
| 8 | 10224745 | HY411654 | 2015-09-05 11:30:00 | 031XX W WASHINGTON BLVD | 0320 | ROBBERY | STRONGARM - NO WEAPON | STREET | False | True | ... | 27 | 27 | 03 | 1155536.0 | 1900515.0 | 2015 | 2018-02-10 15:50:01 | 41.882812 | -87.704323 | (41.88281374, -87.704325717) |
| 9 | 11645836 | JC212333 | 2016-05-01 00:25:00 | 055XX S ROCKWELL ST | 1153 | DECEPTIVE PRACTICE | FINANCIAL IDENTITY THEFT OVER $ 300 | None | False | False | ... | 15 | 63 | 11 | NaN | NaN | 2016 | 2019-04-06 16:04:43 | NaN | NaN | None |
10 rows × 22 columns
num_rows = crimes_df.count()
num_cols = len(crimes_df.columns)
print(num_rows, num_cols)
[Stage 1010:=> (1 + 31) / 32]
7784664 22
crimes_df_s = crimes_df.sample(fraction=0.1, seed=42)
crimes_df.describe()
DataFrame[summary: string, id: string, case_number: string, block: string, iucr: string, primary_type: string, description: string, location_description: string, beat: string, district: string, ward: string, community_area: string, fbi_code: string, x_coordinate: string, y_coordinate: string, year: string, latitude: string, longitude: string, location: string]
Analisando Valores Nulos
print("Linhas:", crimes_df.count())
print("Colunas:", len(crimes_df.columns))
Linhas: 7784664 Colunas: 22
crimes_df.show(5, truncate=False)
+--------+-----------+-------------------+---------------------+----+------------+-----------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+---------+-----------------------------+ |id |case_number|date |block |iucr|primary_type|description |location_description|arrest|domestic|beat|district|ward|community_area|fbi_code|x_coordinate|y_coordinate|year|updated_on |latitude |longitude|location | +--------+-----------+-------------------+---------------------+----+------------+-----------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+---------+-----------------------------+ |10224738|HY411648 |2015-09-05 13:30:00|043XX S WOOD ST |0486|BATTERY |DOMESTIC BATTERY SIMPLE|RESIDENCE |false |true |924 |9 |12 |61 |08B |1165074.0 |1875917.0 |2015|2018-02-10 15:50:01|41.815117|-87.67 |(41.815117282, -87.669999562)| |10224739|HY411615 |2015-09-04 11:30:00|008XX N CENTRAL AVE |0870|THEFT |POCKET-PICKING |CTA BUS |false |false |1511|15 |29 |25 |06 |1138875.0 |1904869.0 |2015|2018-02-10 15:50:01|41.89508 |-87.7654 |(41.895080471, -87.765400451)| |11646166|JC213529 |2018-09-01 00:01:00|082XX S INGLESIDE AVE|0810|THEFT |OVER $500 |RESIDENCE |false |true |631 |6 |8 |44 |06 |NULL |NULL |2018|2019-04-06 16:04:43|NULL |NULL |NULL | |10224740|HY411595 |2015-09-05 12:45:00|035XX W BARRY AVE |2023|NARCOTICS |POSS: HEROIN(BRN/TAN) |SIDEWALK |true |false |1412|14 |35 |21 |18 |1152037.0 |1920384.0 |2015|2018-02-10 15:50:01|41.937405|-87.71665|(41.937405765, -87.716649687)| |10224741|HY411610 |2015-09-05 13:00:00|0000X N LARAMIE AVE |0560|ASSAULT |SIMPLE |APARTMENT |false |true |1522|15 |28 |25 |08A |1141706.0 |1900086.0 |2015|2018-02-10 15:50:01|41.881905|-87.75512|(41.881903443, -87.755121152)| +--------+-----------+-------------------+---------------------+----+------------+-----------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+-------------------+---------+---------+-----------------------------+ only showing top 5 rows
from pyspark.sql.functions import col, sum
total = crimes_df.count()
nulos = crimes_df.select([
(sum(col(c).isNull().cast("int")) / total*100).alias(c)
for c in crimes_df.columns
])
nulos_pd = nulos.toPandas()
nulos_pd
| id | case_number | date | block | iucr | primary_type | description | location_description | arrest | domestic | ... | ward | community_area | fbi_code | x_coordinate | y_coordinate | year | updated_on | latitude | longitude | location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.000051 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.133352 | 0.0 | 0.0 | ... | 7.898196 | 7.880571 | 0.0 | 1.115629 | 1.115629 | 0.0 | 0.0 | 1.115629 | 1.115629 | 1.115629 |
1 rows × 22 columns
import matplotlib.pyplot as plt
# Converter Spark -> Pandas
nulos_pd = nulos.toPandas().T.reset_index()
nulos_pd.columns = ["variavel", "percent_nulls"]
plt.figure(figsize=(12, 6))
plt.bar(nulos_pd["variavel"], nulos_pd["percent_nulls"])
plt.xticks(rotation=90)
plt.ylabel("Percentual de valores nulos (%)")
plt.title("Percentual de valores nulos por variável")
plt.tight_layout()
plt.show()
crimes_df2 = crimes_df.na.drop()
print("Linhas pré tratamento:", crimes_df.count())
print("Linhas pós tratamento de nulos", crimes_df2.count())
Linhas pré tratamento: 7784664
[Stage 21:=============================> (17 + 15) / 32]
Linhas pós tratamento de nulos 7084435
from pyspark.sql.functions import approx_count_distinct as acd
card_df2 = crimes_df2.select([
acd(c).alias(f"{c}_approx_unique")
for c in crimes_df2.columns
])
card_pd = card_df2.toPandas().T
card_pd.columns = ["approx_unique"]
card_pd
25/12/06 21:20:04 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
| approx_unique | |
|---|---|
| id_approx_unique | 7010933 |
| case_number_approx_unique | 6892944 |
| date_approx_unique | 2965609 |
| block_approx_unique | 35056 |
| iucr_approx_unique | 386 |
| primary_type_approx_unique | 36 |
| description_approx_unique | 516 |
| location_description_approx_unique | 226 |
| arrest_approx_unique | 2 |
| domestic_approx_unique | 2 |
| beat_approx_unique | 328 |
| district_approx_unique | 25 |
| ward_approx_unique | 49 |
| community_area_approx_unique | 79 |
| fbi_code_approx_unique | 27 |
| x_coordinate_approx_unique | 73738 |
| y_coordinate_approx_unique | 119819 |
| year_approx_unique | 24 |
| updated_on_approx_unique | 4664 |
| latitude_approx_unique | 84442 |
| longitude_approx_unique | 39758 |
| location_approx_unique | 677206 |
crimes_df2.groupBy("primary_type") \
.count() \
.orderBy("count", ascending=False) \
.show(20, truncate=False)
[Stage 27:=============================================> (26 + 6) / 32]
+--------------------------------+-------+ |primary_type |count | +--------------------------------+-------+ |THEFT |1499197| |BATTERY |1299859| |CRIMINAL DAMAGE |811905 | |NARCOTICS |669097 | |ASSAULT |465810 | |OTHER OFFENSE |440288 | |BURGLARY |390418 | |MOTOR VEHICLE THEFT |339630 | |DECEPTIVE PRACTICE |302833 | |ROBBERY |267994 | |CRIMINAL TRESPASS |195986 | |WEAPONS VIOLATION |100385 | |PROSTITUTION |61348 | |OFFENSE INVOLVING CHILDREN |49456 | |PUBLIC PEACE VIOLATION |48705 | |SEX OFFENSE |26311 | |CRIM SEXUAL ASSAULT |24123 | |INTERFERENCE WITH PUBLIC OFFICER|17821 | |GAMBLING |13405 | |LIQUOR LAW VIOLATION |12782 | +--------------------------------+-------+ only showing top 20 rows
import matplotlib.pyplot as plt
# Agrupamento no Spark
tipos_crimes_df = (
crimes_df2.groupBy("primary_type")
.count()
.orderBy("count", ascending=False)
)
# Coletando para Pandas (seguro porque são poucas categorias)
tipos_crimes_pd = tipos_crimes_df.toPandas()
# Criar gráfico horizontal
plt.figure(figsize=(10, 8))
plt.barh(
tipos_crimes_pd["primary_type"],
tipos_crimes_pd["count"]
)
plt.xlabel("Quantidade de crimes")
plt.ylabel("Tipo de crime")
plt.title("Crimes por Tipo")
plt.gca().invert_yaxis() # maior no topo
plt.tight_layout()
plt.show()
desc_type_df = crimes_df2.groupBy("primary_type", "description") \
.count() \
.orderBy("count", ascending=False)
desc_type_pd = desc_type_df.toPandas()
top20 = desc_type_pd.head(20)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.barh(top20["primary_type"] + " - " + top20["description"], top20["count"])
plt.xlabel("Quantidade de Crimes")
plt.ylabel("Tipo - Descrição")
plt.title("Top 20 Ocorrências por Tipo e Descrição de Crime")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
crimes_df2.groupBy("community_area") \
.count() \
.orderBy("count", ascending=False) \
.show(15)
[Stage 46:==============================> (18 + 14) / 32]
+--------------+------+ |community_area| count| +--------------+------+ | 25|443654| | 8|249001| | 43|234412| | 23|221397| | 28|213872| | 24|207273| | 29|206533| | 67|203194| | 71|201132| | 49|188672| | 68|185313| | 69|176737| | 32|175126| | 66|173057| | 44|156585| +--------------+------+ only showing top 15 rows
crimes_df2.groupBy("district") \
.count() \
.orderBy("count", ascending=False) \
.show(15)
[Stage 49:> (0 + 32) / 32]
+--------+------+ |district| count| +--------+------+ | 8|479088| | 11|457146| | 6|418970| | 7|412732| | 4|406393| | 25|402994| | 3|360823| | 12|348935| | 9|345915| | 2|321434| | 5|316018| | 18|316000| | 19|315365| | 10|307428| | 15|304731| +--------+------+ only showing top 15 rows
crimes_por_ano = crimes_df2.groupBy("year") \
.count() \
.orderBy("year")
pdf = crimes_por_ano.toPandas()
import seaborn as sns
import matplotlib.pyplot as plt
# Tema elegante
sns.set_theme(style="whitegrid")
# Criar figura
plt.figure(figsize=(14, 6))
# Gráfico
ax = sns.lineplot(
data=pdf,
x="year",
y="count",
marker="o",
linewidth=2.2,
markersize=8
)
# Mostrar todos os anos no eixo X
ax.set_xticks(pdf["year"])
# Rotacionar para não sobrepor
plt.xticks(rotation=45)
# Títulos e estética
plt.title("Crimes por Ano", fontsize=18, weight='bold')
plt.xlabel("Ano", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)
# Ajusta layout pra não cortar nada
plt.tight_layout()
plt.show()
from pyspark.sql.functions import hour, month, to_timestamp
crimes_df2 = crimes_df2.withColumn(
"date_ts",
to_timestamp("date", "MM/dd/yyyy hh:mm:ss a")
)
crimes_df2 = crimes_df2.withColumn("month", month("date_ts"))
crimes_hora_ano = crimes_df2.groupBy("year", "month") \
.count() \
.orderBy("year", "month")
pdf3 = crimes_hora_ano.toPandas()
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
plt.figure(figsize=(16, 8))
ax = sns.lineplot(
data=pdf3,
x="month",
y="count",
hue="year",
palette="tab20",
marker="o"
)
# Mostrar todas as horas 0–23
ax.set_xticks(sorted(pdf3["month"].unique()))
plt.xticks(rotation=0)
plt.title("Crimes por Mês do Ano (Separado por Ano)", fontsize=18, weight="bold")
plt.xlabel("Mês", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)
plt.legend(title="Ano", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()
crimes_tipo_ano = crimes_df2.groupBy("year", "primary_type") \
.count() \
.orderBy("year", "primary_type")
pdf2 = crimes_tipo_ano.toPandas()
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
plt.figure(figsize=(16, 8))
ax = sns.lineplot(
data=pdf2,
x="year",
y="count",
hue="primary_type", # ➜ uma linha por tipo!
estimator=None,
marker="o"
)
# Mostrar todos os anos no eixo X
ax.set_xticks(sorted(pdf2["year"].unique()))
plt.xticks(rotation=45)
plt.title("Crimes por Ano e Tipo de Crime", fontsize=18, weight="bold")
plt.xlabel("Ano", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)
plt.legend(title="Tipo de Crime", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()
# tirando os dados de 2001,2002 e 2023
crimes_df2 = crimes_df.filter(
(col("year") != 2001) &
(col("year") != 2002) &
(col("year") != 2023)
)
# tiramos porque os dados nulos estavam concentrados em 2001 e 2002. 2023 estava com dados somente ate abril
from pyspark.sql.functions import hour, to_timestamp
crimes_df2 = crimes_df2.withColumn(
"date_ts",
to_timestamp("date", "MM/dd/yyyy hh:mm:ss a")
)
crimes_df2 = crimes_df2.withColumn("hour", hour("date_ts"))
crimes_hora_ano = crimes_df2.groupBy("year", "hour") \
.count() \
.orderBy("year", "hour")
pdf3 = crimes_hora_ano.toPandas()
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
plt.figure(figsize=(16, 8))
ax = sns.lineplot(
data=pdf3,
x="hour",
y="count",
hue="year",
palette="tab20",
marker="o"
)
# Mostrar todas as horas 0–23
ax.set_xticks(sorted(pdf3["hour"].unique()))
plt.xticks(rotation=0)
plt.title("Crimes por Hora do Dia (Separado por Ano)", fontsize=18, weight="bold")
plt.xlabel("Hora do Dia", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)
plt.legend(title="Ano", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()
crimes_df2.limit(10).toPandas().T
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 10224738 | 10224739 | 11646166 | 10224740 | 10224741 | 10224742 | 10224743 | 10224744 | 10224745 | 11645836 |
| case_number | HY411648 | HY411615 | JC213529 | HY411595 | HY411610 | HY411435 | HY411629 | HY411605 | HY411654 | JC212333 |
| date | 2015-09-05 13:30:00 | 2015-09-04 11:30:00 | 2018-09-01 00:01:00 | 2015-09-05 12:45:00 | 2015-09-05 13:00:00 | 2015-09-05 10:55:00 | 2015-09-04 18:00:00 | 2015-09-05 13:00:00 | 2015-09-05 11:30:00 | 2016-05-01 00:25:00 |
| block | 043XX S WOOD ST | 008XX N CENTRAL AVE | 082XX S INGLESIDE AVE | 035XX W BARRY AVE | 0000X N LARAMIE AVE | 082XX S LOOMIS BLVD | 021XX W CHURCHILL ST | 025XX W CERMAK RD | 031XX W WASHINGTON BLVD | 055XX S ROCKWELL ST |
| iucr | 0486 | 0870 | 0810 | 2023 | 0560 | 0610 | 0620 | 0860 | 0320 | 1153 |
| primary_type | BATTERY | THEFT | THEFT | NARCOTICS | ASSAULT | BURGLARY | BURGLARY | THEFT | ROBBERY | DECEPTIVE PRACTICE |
| description | DOMESTIC BATTERY SIMPLE | POCKET-PICKING | OVER $500 | POSS: HEROIN(BRN/TAN) | SIMPLE | FORCIBLE ENTRY | UNLAWFUL ENTRY | RETAIL THEFT | STRONGARM - NO WEAPON | FINANCIAL IDENTITY THEFT OVER $ 300 |
| location_description | RESIDENCE | CTA BUS | RESIDENCE | SIDEWALK | APARTMENT | RESIDENCE | RESIDENCE-GARAGE | GROCERY FOOD STORE | STREET | None |
| arrest | False | False | False | True | False | False | False | True | False | False |
| domestic | True | False | True | False | True | False | False | False | True | False |
| beat | 924 | 1511 | 631 | 1412 | 1522 | 614 | 1434 | 1034 | 1222 | 824 |
| district | 9 | 15 | 6 | 14 | 15 | 6 | 14 | 10 | 12 | 8 |
| ward | 12 | 29 | 8 | 35 | 28 | 21 | 32 | 25 | 27 | 15 |
| community_area | 61 | 25 | 44 | 21 | 25 | 71 | 24 | 31 | 27 | 63 |
| fbi_code | 08B | 06 | 06 | 18 | 08A | 05 | 05 | 06 | 03 | 11 |
| x_coordinate | 1165074.0 | 1138875.0 | NaN | 1152037.0 | 1141706.0 | 1168430.0 | 1161628.0 | 1159734.0 | 1155536.0 | NaN |
| y_coordinate | 1875917.0 | 1904869.0 | NaN | 1920384.0 | 1900086.0 | 1850165.0 | 1912157.0 | 1889313.0 | 1900515.0 | NaN |
| year | 2015 | 2015 | 2018 | 2015 | 2015 | 2015 | 2015 | 2015 | 2015 | 2016 |
| updated_on | 2018-02-10 15:50:01 | 2018-02-10 15:50:01 | 2019-04-06 16:04:43 | 2018-02-10 15:50:01 | 2018-02-10 15:50:01 | 2018-02-10 15:50:01 | 2018-02-10 15:50:01 | 2015-09-17 11:37:18 | 2018-02-10 15:50:01 | 2019-04-06 16:04:43 |
| latitude | 41.815117 | 41.895081 | NaN | 41.937405 | 41.881905 | 41.744377 | 41.914635 | 41.85199 | 41.882812 | NaN |
| longitude | -87.669998 | -87.765404 | NaN | -87.716652 | -87.755119 | -87.658432 | -87.681633 | -87.689217 | -87.704323 | NaN |
| location | (41.815117282, -87.669999562) | (41.895080471, -87.765400451) | None | (41.937405765, -87.716649687) | (41.881903443, -87.755121152) | (41.744378879, -87.658430635) | (41.914635603, -87.681630909) | (41.851988885, -87.689219118) | (41.88281374, -87.704325717) | None |
| date_ts | 2015-09-05 13:30:00 | 2015-09-04 11:30:00 | 2018-09-01 00:01:00 | 2015-09-05 12:45:00 | 2015-09-05 13:00:00 | 2015-09-05 10:55:00 | 2015-09-04 18:00:00 | 2015-09-05 13:00:00 | 2015-09-05 11:30:00 | 2016-05-01 00:25:00 |
| hour | 13 | 11 | 0 | 12 | 13 | 10 | 18 | 13 | 11 | 0 |
from pyspark.sql.functions import hour
crimes_df2 = crimes_df2.withColumn("hour", hour("date"))
crimes_df2.select("date", "hour").show(20, truncate=False)
+-------------------+----+ |date |hour| +-------------------+----+ |2015-09-05 13:30:00|13 | |2015-09-04 11:30:00|11 | |2018-09-01 00:01:00|0 | |2015-09-05 12:45:00|12 | |2015-09-05 13:00:00|13 | |2015-09-05 10:55:00|10 | |2015-09-04 18:00:00|18 | |2015-09-05 13:00:00|13 | |2015-09-05 11:30:00|11 | |2016-05-01 00:25:00|0 | |2015-09-05 14:00:00|14 | |2015-09-05 11:00:00|11 | |2015-09-05 03:00:00|3 | |2015-09-05 12:50:00|12 | |2015-09-03 13:00:00|13 | |2015-09-05 11:45:00|11 | |2015-09-05 13:30:00|13 | |2015-07-08 00:00:00|0 | |2015-09-05 09:55:00|9 | |2015-09-05 12:35:00|12 | +-------------------+----+ only showing top 20 rows
from pyspark.sql.functions import dayofweek
crimes_df2 = crimes_df2.withColumn("day_of_week", dayofweek("date_ts"))
crimes_semana_ano = crimes_df2.groupBy("year", "day_of_week") \
.count() \
.orderBy("year", "day_of_week")
pdf_semana = crimes_semana_ano.toPandas()
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
plt.figure(figsize=(16, 8))
ax = sns.lineplot(
data=pdf_semana,
x="day_of_week",
y="count",
hue="year",
palette="tab20",
marker="o"
)
# Mostrar dias 1–7
ax.set_xticks(sorted(pdf_semana["day_of_week"].unique()))
plt.xticks(rotation=0)
plt.title("Crimes por Dia da Semana (Separado por Ano)", fontsize=18, weight="bold")
plt.xlabel("Dia da Semana (1=Dom, 7=Sáb)", fontsize=14)
plt.ylabel("Quantidade de Crimes", fontsize=14)
plt.legend(title="Ano", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()
from pyspark.sql.functions import col
prisao_abs = crimes_df2.filter(col("arrest") == True) \
.groupBy("primary_type") \
.count() \
.orderBy("count", ascending=False)
from pyspark.sql.functions import col, sum, count
prisao_prop = crimes_df2.groupBy("primary_type") \
.agg(
count("*").alias("total_casos"),
sum(col("arrest").cast("int")).alias("total_prisoes")
) \
.withColumn("proporcao_prisao", col("total_prisoes") / col("total_casos")) \
.orderBy(col("proporcao_prisao").desc())
pdf_prisao = prisao_prop.toPandas()
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
plt.figure(figsize=(14, 8))
# ordenar por proporção
df15 = pdf_prisao.head(15).sort_values("proporcao_prisao")
sns.barplot(
data=df15,
x="proporcao_prisao",
y="primary_type",
palette="rocket",
hue="primary_type",
dodge=False,
legend=False
)
plt.title("Proporção de Prisões por Tipo de Crime", fontsize=18, weight="bold")
plt.xlabel("Proporção de Prisões")
plt.ylabel("Tipo de Crime")
plt.tight_layout()
plt.show()
from pyspark.sql.functions import create_map, lit, col
mapping = {
1: "Rogers Park", 2: "West Ridge", 3: "Uptown", 4: "Lincoln Square",
5: "North Center", 6: "Lake View", 7: "Lincoln Park", 8: "Near North Side",
9: "Edison Park", 10: "Norwood Park", 11: "Jefferson Park", 12: "Forest Glen",
13: "North Park", 14: "Albany Park", 15: "Portage Park", 16: "Irving Park",
17: "Dunning", 18: "Montclair", 19: "Belmont Cragin", 20: "Hermosa",
21: "Avondale", 22: "Logan Square", 23: "Humboldt Park", 24: "West Town",
25: "Austin", 26: "West Garfield Park", 27: "East Garfield Park",
28: "Near West Side", 29: "North Lawndale", 30: "South Lawndale",
31: "Lower West Side", 32: "Loop", 33: "Near South Side", 34: "Armour Square",
35: "Douglas", 36: "Oakland", 37: "Fuller Park", 38: "Grand Boulevard",
39: "Kenwood", 40: "Washington Park", 41: "Hyde Park", 42: "Woodlawn",
43: "South Shore", 44: "Chatham", 45: "Avalon Park", 46: "South Chicago",
47: "Burnside", 48: "Calumet Heights", 49: "Roseland", 50: "Pullman",
51: "South Deering", 52: "East Side", 53: "West Pullman", 54: "Riverdale",
55: "Hegewisch", 56: "Garfield Ridge", 57: "Archer Heights",
58: "Brighton Park", 59: "McKinley Park", 60: "Bridgeport", 61: "New City",
62: "West Elsdon", 63: "Gage Park", 64: "Clearing", 65: "West Lawn",
66: "Chicago Lawn", 67: "West Englewood", 68: "Englewood",
69: "Greater Grand Crossing", 70: "Ashburn", 71: "Auburn Gresham",
72: "Beverly", 73: "Washington Heights", 74: "Mount Greenwood",
75: "Morgan Park", 76: "O'Hare", 77: "Edgewater"
}
# converter dict → lista alternada para create_map
mapping_expr = create_map([lit(x) for pair in mapping.items() for x in pair])
crimes_df2 = crimes_df.withColumn("community_area_name", mapping_expr[col("community_area")])
areas_df = spark.read.csv("Boundaries_-_Community_Areas_20251201.csv", header=True, inferSchema=True)
print("Colunas do CSV:")
print(areas_df.columns)
areas_df.toPandas()
Colunas do CSV: ['the_geom', 'AREA_NUMBE', 'COMMUNITY', 'AREA_NUM_1', 'SHAPE_AREA', 'SHAPE_LEN']
| the_geom | AREA_NUMBE | COMMUNITY | AREA_NUM_1 | SHAPE_AREA | SHAPE_LEN | |
|---|---|---|---|---|---|---|
| 0 | MULTIPOLYGON (((-87.65455590025104 41.99816614... | 1 | ROGERS PARK | 1 | 51,259,902.4506 | 34,052.3975757 |
| 1 | MULTIPOLYGON (((-87.6846530946559 42.019484772... | 2 | WEST RIDGE | 2 | 98,429,094.8621 | 43,020.6894583 |
| 2 | MULTIPOLYGON (((-87.64102430213292 41.95480280... | 3 | UPTOWN | 3 | 65,095,642.7289 | 46,972.7945549 |
| 3 | MULTIPOLYGON (((-87.6744075678037 41.976103404... | 4 | LINCOLN SQUARE | 4 | 71,352,328.2399 | 36,624.6030848 |
| 4 | MULTIPOLYGON (((-87.67336415409336 41.93234274... | 5 | NORTH CENTER | 5 | 57,054,167.85 | 31,391.6697542 |
| ... | ... | ... | ... | ... | ... | ... |
| 72 | MULTIPOLYGON (((-87.63373383514987 41.72885272... | 73 | WASHINGTON HEIGHTS | 73 | 79,635,752.8769 | 42,222.598163 |
| 73 | MULTIPOLYGON (((-87.69645961375822 41.70714491... | 74 | MOUNT GREENWOOD | 74 | 75,584,290.0209 | 48,665.1305392 |
| 74 | MULTIPOLYGON (((-87.64215204651398 41.68508211... | 75 | MORGAN PARK | 75 | 91,877,340.6988 | 46,396.419362 |
| 75 | MULTIPOLYGON (((-87.83658087874365 41.98639611... | 76 | OHARE | 76 | 371,835,607.687 | 173,625.98466 |
| 76 | MULTIPOLYGON (((-87.65455590025104 41.99816614... | 77 | EDGEWATER | 77 | 48,449,990.8397 | 31,004.8309456 |
77 rows × 6 columns
areas_clean = areas_df.select(
col("AREA_NUM_1").cast("int").alias("community_area"),
col("COMMUNITY").alias("community_area_name"),
col("the_geom"),
col("SHAPE_AREA").alias("shape_area"),
col("SHAPE_LEN").alias("shape_len"))
c = crimes_df2.alias("c")
a = areas_clean.alias("a")
crimes_joined = c.join(a, on="community_area", how="left")
taxa_prisao_area = crimes_joined.groupBy(col("c.community_area"), col("a.community_area_name"),col("the_geom"),) \
.agg(
count("*").alias("total_casos"),
sum(col("c.arrest").cast("int")).alias("total_prisoes")
) \
.withColumn("taxa_prisao", col("total_prisoes") / col("total_casos")) \
.orderBy(col("taxa_prisao").desc())
taxa_prisao_area.toPandas()
| community_area | community_area_name | the_geom | total_casos | total_prisoes | taxa_prisao | |
|---|---|---|---|---|---|---|
| 0 | 26.0 | WEST GARFIELD PARK | MULTIPOLYGON (((-87.72023936013656 41.86986908... | 135522 | 58957 | 0.435036 |
| 1 | 27.0 | EAST GARFIELD PARK | MULTIPOLYGON (((-87.69157000948773 41.88819563... | 134276 | 51471 | 0.383322 |
| 2 | 25.0 | AUSTIN | MULTIPOLYGON (((-87.78941511405804 41.91751009... | 448276 | 168183 | 0.375177 |
| 3 | 23.0 | HUMBOLDT PARK | MULTIPOLYGON (((-87.69157000948773 41.88819563... | 223982 | 84005 | 0.375052 |
| 4 | 29.0 | NORTH LAWNDALE | MULTIPOLYGON (((-87.72023936013656 41.86986908... | 209901 | 77906 | 0.371156 |
| ... | ... | ... | ... | ... | ... | ... |
| 74 | 7.0 | LINCOLN PARK | MULTIPOLYGON (((-87.63181810269614 41.93258180... | 111392 | 14793 | 0.132801 |
| 75 | 72.0 | BEVERLY | MULTIPOLYGON (((-87.67308255594219 41.73565672... | 26039 | 3349 | 0.128615 |
| 76 | 9.0 | EDISON PARK | MULTIPOLYGON (((-87.80675853375328 42.00083736... | 7128 | 806 | 0.113075 |
| 77 | 0.0 | None | None | 76 | 8 | 0.105263 |
| 78 | 12.0 | FOREST GLEN | MULTIPOLYGON (((-87.76918527760162 42.00488913... | 13342 | 1392 | 0.104332 |
79 rows × 6 columns
from shapely import wkt
import geopandas as gpd
import matplotlib.pyplot as plt
taxa_prisao_pd = taxa_prisao_area.toPandas()
taxa_prisao_pd = taxa_prisao_pd.dropna(subset=["the_geom"])
taxa_prisao_pd["geometry"] = taxa_prisao_pd["the_geom"].apply(wkt.loads)
gdf = gpd.GeoDataFrame(
taxa_prisao_pd,
geometry="geometry",
crs="EPSG:4326"
)
gdf = gdf.to_crs(epsg=3857)
fig, ax = plt.subplots(figsize=(8, 8))
gdf.plot(
column="taxa_prisao",
ax=ax,
edgecolor="black",
linewidth=0.6,
legend=True
)
ax.set_title("Taxa de prisão por Community Area em Chicago")
ax.set_axis_off()
plt.savefig("chicago_taxa_prisao_map.png", dpi=150, bbox_inches="tight")
plt.show()
from pyspark.sql import functions as F
crimes_prisao = crimes_df2.groupBy("arrest") \
.count() \
.withColumn("percent", F.col("count") / crimes_df.count() * 100) \
.withColumn("percent", F.concat(F.format_number("percent", 2), F.lit("%")))
crimes_prisao.show()
# a variável alvo está desbalanceada
[Stage 146:=====> (3 + 29) / 32]
+------+-------+-------+ |arrest| count|percent| +------+-------+-------+ | true|2034764| 26.14%| | false|5749900| 73.86%| +------+-------+-------+
crimes_df2 = crimes_df2.withColumn("hour", hour(col("date")))
from pyspark.sql.functions import col, when
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
df_model = crimes_df2.select(
col("arrest").cast("int").alias("label"),
"arrest",
col("domestic").cast("string").alias("domestic"),
"primary_type",
"location_description",
col("hour").cast("int").alias("hour"),
col("year").cast("int").alias("year"),
col("community_area").cast("int").alias("community_area"),
col("latitude").cast("double").alias("latitude"),
col("longitude").cast("double").alias("longitude"),
col("beat").cast("int").alias("beat"),
col("district").cast("int").alias("district"),
col("ward").cast("int").alias("ward")
)
# 0) criar coluna 'hour' a partir da coluna de data
from pyspark.sql.functions import hour, col, when
crimes_df2 = crimes_df2.withColumn("hour", hour(col("date")))
# ==========================================
# 1. SELECIONAR COLUNAS E AJUSTAR TIPOS
# ==========================================
df_model = crimes_df2.select(
# alvo: arrest -> vira label numérico 0/1
col("arrest").cast("int").alias("label"),
"arrest",
# domestic como string para o StringIndexer
col("domestic").cast("string").alias("domestic"),
# categóricas
"primary_type",
"location_description",
# numéricas (tipos garantidos)
col("hour").cast("int").alias("hour"),
col("year").cast("int").alias("year"),
col("community_area").cast("int").alias("community_area"),
col("latitude").cast("double").alias("latitude"),
col("longitude").cast("double").alias("longitude"),
col("beat").cast("int").alias("beat"),
col("district").cast("int").alias("district"),
col("ward").cast("int").alias("ward")
)
df_model = df_model.dropna(subset=[
"label", "domestic", "primary_type", "location_description",
"hour", "year", "community_area", "latitude", "longitude",
"beat", "district", "ward"
])
print("Linhas no df_model (após dropna):", df_model.count())
df_model.printSchema()
[Stage 155:=====================================================> (31 + 1) / 32]
Linhas no df_model (após dropna): 7084438 root |-- label: integer (nullable = true) |-- arrest: boolean (nullable = true) |-- domestic: string (nullable = true) |-- primary_type: string (nullable = true) |-- location_description: string (nullable = true) |-- hour: integer (nullable = true) |-- year: integer (nullable = true) |-- community_area: integer (nullable = true) |-- latitude: double (nullable = true) |-- longitude: double (nullable = true) |-- beat: integer (nullable = true) |-- district: integer (nullable = true) |-- ward: integer (nullable = true)
from pyspark.sql.functions import col
# ==========================================
# 2. CRIAR AMOSTRA MENOR (df_small)
# ==========================================
# pega ~5% da base para não estourar memória na modelagem
df_small = df_model.sample(withReplacement=False, fraction=0.05, seed=42)
print("Linhas no df_model completo :", df_model.count())
print("Linhas na amostra (df_small):", df_small.count())
# ver o desbalanceamento da variável alvo na amostra
df_small.groupBy("label").count().show()
Linhas no df_model completo : 7084438
Linhas na amostra (df_small): 353791
[Stage 164:============================================> (26 + 6) / 32]
+-----+------+ |label| count| +-----+------+ | 1| 91896| | 0|261895| +-----+------+
from pyspark.sql.functions import when
# ==========================================
# 3. BALANCEAR A CLASSE COM WEIGHT + SPLIT
# ==========================================
# contar classe majoritária e minoritária na amostra
major = df_small.filter(col("label") == 0).count()
minor = df_small.filter(col("label") == 1).count()
print("Classe 0 (sem prisão):", major)
print("Classe 1 (com prisão):", minor)
ratio = major / minor
print("Peso aplicado à classe 1 (prisão):", ratio)
# criar coluna de peso
df_small = df_small.withColumn(
"weight",
when(col("label") == 1, ratio).otherwise(1.0)
)
# split treino / teste
train, test = df_small.randomSplit([0.8, 0.2], seed=42)
print("Linhas treino:", train.count())
print("Linhas teste :", test.count())
Classe 0 (sem prisão): 261895 Classe 1 (com prisão): 91896 Peso aplicado à classe 1 (prisão): 2.8499064159484635
Linhas treino: 282923
[Stage 176:=====================================================> (31 + 1) / 32]
Linhas teste : 70868
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
# ==========================================
# 4. PRÉ-PROCESSAMENTO + MODELO (PIPELINE)
# ==========================================
# colunas categóricas que vamos codificar
categorical_cols = [
"primary_type",
"location_description",
"domestic",
"community_area"
]
# StringIndexer: texto -> índice numérico
indexers = [
StringIndexer(
inputCol=c,
outputCol=f"{c}_idx",
handleInvalid="keep"
)
for c in categorical_cols
]
# OneHotEncoder: índice -> vetor binário
encoders = [
OneHotEncoder(
inputCols=[f"{c}_idx"],
outputCols=[f"{c}_ohe"]
)
for c in categorical_cols
]
# colunas numéricas + OHE que vão virar o vetor de features
feature_cols = [
"primary_type_ohe",
"location_description_ohe",
"domestic_ohe",
"community_area_ohe",
"hour",
"year",
"latitude",
"longitude",
"beat",
"district",
"ward"
]
assembler = VectorAssembler(
inputCols=feature_cols,
outputCol="features"
)
25/12/06 19:57:08 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS [Stage 237:> (0 + 1) / 1]
+-----+----------------------------------------+----------+ |label|probability |prediction| +-----+----------------------------------------+----------+ |0 |[0.7306128724815133,0.2693871275184867] |0.0 | |0 |[0.8482676816866546,0.15173231831334544]|0.0 | |0 |[0.854632023435212,0.14536797656478795] |0.0 | |0 |[0.7668307622814486,0.23316923771855136]|0.0 | |0 |[0.7543679356585427,0.24563206434145735]|0.0 | |0 |[0.4845632496315539,0.515436750368446] |1.0 | |0 |[0.5956006353741462,0.4043993646258538] |0.0 | |0 |[0.5459528236746588,0.4540471763253412] |0.0 | |0 |[0.6910459715206198,0.30895402847938025]|0.0 | |0 |[0.683761820797112,0.316238179202888] |0.0 | +-----+----------------------------------------+----------+ only showing top 10 rows
# modelo de regressão logística usando o peso
lr = LogisticRegression(
labelCol="label",
featuresCol="features",
weightCol="weight",
maxIter=20
)
# pipeline completa: indexers -> encoders -> assembler -> modelo
pipeline_lr = Pipeline(stages=indexers + encoders + [assembler, lr])
# treinar o modelo
lr_model = pipeline_lr.fit(train)
# gerar previsões no conjunto de teste
pred_lr = lr_model.transform(test)
pred_lr.select("label", "probability", "prediction").show(10, truncate=False)
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.sql.functions import countDistinct
# ==========================================
# 5. AVALIAÇÃO DO MODELO (AUC, F1, ACCURÁCIA)
# ==========================================
# ver se temos as duas classes no teste
pred_lr.groupBy("label").count().show()
num_classes_test = pred_lr.select(countDistinct("label")).collect()[0][0]
print("Número de classes no conjunto de teste:", num_classes_test)
# --- AUC (curva ROC) ---
# usando a coluna 'probability'
evaluator_auc = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="probability", # usa a probabilidade da classe 1
metricName="areaUnderROC"
)
if num_classes_test > 1:
auc_lr = evaluator_auc.evaluate(pred_lr)
print("AUC (Logistic Regression):", auc_lr)
else:
print("AUC não pôde ser calculado: só há uma classe no conjunto de teste.")
# --- F1-score ---
evaluator_f1 = MulticlassClassificationEvaluator(
labelCol="label",
predictionCol="prediction",
metricName="f1"
)
f1_lr = evaluator_f1.evaluate(pred_lr)
print("F1 (Logistic Regression):", f1_lr)
# --- Acurácia ---
evaluator_acc = MulticlassClassificationEvaluator(
labelCol="label",
predictionCol="prediction",
metricName="accuracy"
)
acc_lr = evaluator_acc.evaluate(pred_lr)
print("Accuracy (Logistic Regression):", acc_lr)
+-----+-----+ |label|count| +-----+-----+ | 1|18430| | 0|52438| +-----+-----+
Número de classes no conjunto de teste: 2
AUC (Logistic Regression): 0.8762476988301116
F1 (Logistic Regression): 0.8358207838802699
[Stage 260:================================================> (28 + 4) / 32]
Accuracy (Logistic Regression): 0.835906191793193
from pyspark.sql.functions import col
# MATRIZ DE CONFUSÃO (tabela simples)
confusion = (
pred_lr
.groupBy("label", "prediction")
.count()
.orderBy("label", "prediction")
)
confusion.show()
[Stage 262:=========================================> (24 + 8) / 32]
+-----+----------+-----+ |label|prediction|count| +-----+----------+-----+ | 0| 0.0|46653| | 0| 1.0| 5785| | 1| 0.0| 5844| | 1| 1.0|12586| +-----+----------+-----+
from pyspark.sql.functions import sum as Fsum, when
# Criar colunas booleanas para cada combinação
metrics_df = pred_lr.select(
when( (col("label") == 1) & (col("prediction") == 1), 1).otherwise(0).alias("TP"),
when( (col("label") == 0) & (col("prediction") == 1), 1).otherwise(0).alias("FP"),
when( (col("label") == 1) & (col("prediction") == 0), 1).otherwise(0).alias("FN"),
when( (col("label") == 0) & (col("prediction") == 0), 1).otherwise(0).alias("TN")
)
agg = metrics_df.agg(
Fsum("TP").alias("TP"),
Fsum("FP").alias("FP"),
Fsum("FN").alias("FN"),
Fsum("TN").alias("TN")
).collect()[0]
TP = agg["TP"]
FP = agg["FP"]
FN = agg["FN"]
TN = agg["TN"]
print("TP:", TP, "FP:", FP, "FN:", FN, "TN:", TN)
# precisão, recall, F1 para classe 1 (prisão)
precision_1 = TP / (TP + FP) if (TP + FP) > 0 else 0
recall_1 = TP / (TP + FN) if (TP + FN) > 0 else 0
f1_1 = 2 * precision_1 * recall_1 / (precision_1 + recall_1) if (precision_1 + recall_1) > 0 else 0
print(f"Classe 1 (prisão) - Precision: {precision_1:.3f}, Recall: {recall_1:.3f}, F1: {f1_1:.3f}")
[Stage 265:=========================> (15 + 17) / 32]
TP: 12586 FP: 5785 FN: 5844 TN: 46653 Classe 1 (prisão) - Precision: 0.685, Recall: 0.683, F1: 0.684
# Classe 0
precision_0 = TN / (TN + FN) if (TN + FN) > 0 else 0
recall_0 = TN / (TN + FP) if (TN + FP) > 0 else 0
f1_0 = 2 * precision_0 * recall_0 / (precision_0 + recall_0) if (precision_0 + recall_0) > 0 else 0
print(f"Classe 0 (sem prisão) - Precision: {precision_0:.3f}, Recall: {recall_0:.3f}, F1: {f1_0:.3f}")
Classe 0 (sem prisão) - Precision: 0.889, Recall: 0.890, F1: 0.889
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator_f1 = MulticlassClassificationEvaluator(
labelCol="label",
predictionCol="prediction",
metricName="f1"
)
macro_f1 = evaluator_f1.evaluate(pred_lr)
print("Macro F1 (média entre classes):", macro_f1)
[Stage 268:> (0 + 32) / 32]
Macro F1 (média entre classes): 0.8358207838802699
from pyspark.ml.classification import LogisticRegressionModel
# pegar o último estágio da pipeline (deve ser o modelo treinado)
last_stage = lr_model.stages[-1]
print("Tipo do último estágio:", type(last_stage))
# garantir que é um LogisticRegressionModel
if isinstance(last_stage, LogisticRegressionModel):
lr_stage = last_stage
training_summary = lr_stage.summary
# DataFrame com pontos da curva ROC
roc_df = training_summary.roc
roc_df.show(5)
print("AUC (via summary):", training_summary.areaUnderROC)
else:
print("Último estágio não é LogisticRegressionModel, é:", type(last_stage))
Tipo do último estágio: <class 'pyspark.ml.classification.LogisticRegressionModel'>
+--------------------+--------------------+ | FPR| TPR| +--------------------+--------------------+ | 0.0| 0.0| |9.548499214635939E-6|0.003865733808836...| |2.864549764390782E-5|0.008139819780578763| |2.864549764390782E-5|0.012060000544469537| |3.341974725122579E-5|0.016238804344866995| +--------------------+--------------------+ only showing top 5 rows AUC (via summary): 0.8770441487678992
roc_pd = roc_df.toPandas()
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 6))
plt.plot(roc_pd["FPR"], roc_pd["TPR"])
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Curva ROC - Regressão Logística")
plt.grid(True)
plt.tight_layout()
plt.show()
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
# Modelo de Floresta Aleatória
rf = RandomForestClassifier(
labelCol="label",
featuresCol="features",
weightCol="weight", # usa os mesmos pesos da logística
numTrees=50, # pode reduzir p/ 50 se der memória
maxDepth=10, # controla complexidade
featureSubsetStrategy="auto",
seed=42
)
# Pipeline: indexers -> encoders -> assembler -> Random Forest
pipeline_rf = Pipeline(stages=indexers + encoders + [assembler, rf])
# Treinar a Random Forest
rf_model = pipeline_rf.fit(train)
# Predições no conjunto de teste
pred_rf = rf_model.transform(test)
pred_rf.select("label", "probability", "prediction").show(10, truncate=False)
25/12/06 19:57:49 WARN MemoryStore: Not enough space to cache rdd_578_27 in memory! (computed 8.0 MiB so far) 25/12/06 19:57:49 WARN MemoryStore: Not enough space to cache rdd_578_2 in memory! (computed 12.4 MiB so far) 25/12/06 19:57:49 WARN MemoryStore: Not enough space to cache rdd_578_1 in memory! (computed 12.4 MiB so far) 25/12/06 19:57:49 WARN MemoryStore: Not enough space to cache rdd_578_28 in memory! (computed 12.4 MiB so far) 25/12/06 19:57:49 WARN BlockManager: Persisting block rdd_578_27 to disk instead. 25/12/06 19:57:49 WARN BlockManager: Persisting block rdd_578_1 to disk instead. 25/12/06 19:57:49 WARN BlockManager: Persisting block rdd_578_2 to disk instead. 25/12/06 19:57:49 WARN BlockManager: Persisting block rdd_578_28 to disk instead. 25/12/06 19:57:54 WARN DAGScheduler: Broadcasting large task binary with size 1269.3 KiB 25/12/06 19:57:55 WARN DAGScheduler: Broadcasting large task binary with size 1791.6 KiB 25/12/06 19:57:56 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB 25/12/06 19:57:58 WARN DAGScheduler: Broadcasting large task binary with size 1331.9 KiB [Stage 277:> (0 + 1) / 1]
+-----+----------------------------------------+----------+ |label|probability |prediction| +-----+----------------------------------------+----------+ |0 |[0.5082534895697707,0.4917465104302293] |0.0 | |0 |[0.6237785832628645,0.37622141673713544]|0.0 | |0 |[0.6170512561154904,0.38294874388450956]|0.0 | |0 |[0.4386105172615851,0.5613894827384149] |1.0 | |0 |[0.5267942750429343,0.4732057249570656] |0.0 | |0 |[0.5039587905528343,0.4960412094471657] |0.0 | |0 |[0.5642040166295567,0.4357959833704433] |0.0 | |0 |[0.5539469405294192,0.44605305947058077]|0.0 | |0 |[0.563183374732184,0.43681662526781595] |0.0 | |0 |[0.5579666344659647,0.4420333655340352] |0.0 | +-----+----------------------------------------+----------+ only showing top 10 rows
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
evaluator_auc = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="rawPrediction",
metricName="areaUnderROC"
)
evaluator_f1 = MulticlassClassificationEvaluator(
labelCol="label",
predictionCol="prediction",
metricName="f1"
)
evaluator_acc = MulticlassClassificationEvaluator(
labelCol="label",
predictionCol="prediction",
metricName="accuracy"
)
auc_rf = evaluator_auc.evaluate(pred_rf)
f1_rf = evaluator_f1.evaluate(pred_rf)
acc_rf = evaluator_acc.evaluate(pred_rf)
print("=== RANDOM FOREST ===")
print(f"AUC: {auc_rf:.4f}")
print(f"F1-score: {f1_rf:.4f}")
print(f"Accuracy: {acc_rf:.4f}")
25/12/06 19:58:32 WARN DAGScheduler: Broadcasting large task binary with size 1328.9 KiB 25/12/06 19:58:36 WARN DAGScheduler: Broadcasting large task binary with size 1340.6 KiB 25/12/06 19:58:40 WARN DAGScheduler: Broadcasting large task binary with size 1340.6 KiB [Stage 291:===================================================> (30 + 2) / 32]
=== RANDOM FOREST === AUC: 0.8657 F1-score: 0.8389 Accuracy: 0.8425
## vai dar erro, ver de tirar
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.sql.functions import col
# ==========================================
# 1. Definir o modelo base de Random Forest
# (reaproveitando indexers, encoders, assembler)
# ==========================================
rf = RandomForestClassifier(
labelCol="label",
featuresCol="features",
weightCol="weight", # se der muito pesado, você pode remover este parâmetro
seed=42
)
pipeline_rf = Pipeline(stages=indexers + encoders + [assembler, rf])
# ==========================================
# 2. Definir grade de hiperparâmetros (pequena!)
# ==========================================
paramGrid_rf = (
ParamGridBuilder()
.addGrid(rf.numTrees, [50, 150]) # número de árvores
.addGrid(rf.maxDepth, [8, 12]) # profundidade máxima
.addGrid(rf.featureSubsetStrategy, ["sqrt", "log2"]) # nº de features por split
.build()
)
# ==========================================
# 3. Definir avaliador (otimizando AUC)
# ==========================================
evaluator_auc = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="rawPrediction",
metricName="areaUnderROC"
)
# ==========================================
# 4. TrainValidationSplit (mais leve que cross-validation)
# ==========================================
tvs_rf = TrainValidationSplit(
estimator=pipeline_rf,
estimatorParamMaps=paramGrid_rf,
evaluator=evaluator_auc,
trainRatio=0.8, # 80% treino / 20% validação dentro do conjunto de treino
parallelism=2 # se o ambiente suportar
)
# ==========================================
# 5. Treinar com tuning na amostra (df_small)
# (garantir que train/test vieram de df_small)
# ==========================================
# Se ainda não tiver feito o split usando df_small:
# train, test = df_small.randomSplit([0.8, 0.2], seed=42)
print("Treinando Random Forest com tuning...")
tvs_model_rf = tvs_rf.fit(train)
# Melhor modelo dentro do grid
best_model_rf = tvs_model_rf.bestModel
best_rf_stage = best_model_rf.stages[-1] # último estágio é o RandomForestModel
print("Melhores hiperparâmetros encontrados:")
print(" numTrees:", best_rf_stage.getNumTrees)
print(" maxDepth:", best_rf_stage.getOrDefault("maxDepth"))
print(" featureSubsetStrategy:", best_rf_stage.getOrDefault("featureSubsetStrategy"))
# ==========================================
# 6. Avaliar o melhor modelo no conjunto de teste
# ==========================================
pred_rf_best = best_model_rf.transform(test)
# AUC
auc_rf = evaluator_auc.evaluate(pred_rf_best)
print("\n=== RANDOM FOREST (TUNADO) ===")
print(f"AUC: {auc_rf:.4f}")
# F1 e Accuracy
evaluator_f1 = MulticlassClassificationEvaluator(
labelCol="label",
predictionCol="prediction",
metricName="f1"
)
evaluator_acc = MulticlassClassificationEvaluator(
labelCol="label",
predictionCol="prediction",
metricName="accuracy"
)
f1_rf = evaluator_f1.evaluate(pred_rf_best)
acc_rf = evaluator_acc.evaluate(pred_rf_best)
print(f"F1-score: {f1_rf:.4f}")
print(f"Accuracy: {acc_rf:.4f}")
# (opcional) ver distribuição de acertos/erros
pred_rf_best.groupBy("label", "prediction").count().show()
Treinando Random Forest com tuning...
25/12/06 20:14:11 WARN DAGScheduler: Broadcasting large task binary with size 1206.8 KiB
25/12/06 20:14:50 WARN DAGScheduler: Broadcasting large task binary with size 1206.8 KiB
25/12/06 20:14:52 WARN DAGScheduler: Broadcasting large task binary with size 1085.7 KiB
25/12/06 20:14:53 WARN DAGScheduler: Broadcasting large task binary with size 1670.3 KiB
25/12/06 20:14:53 WARN DAGScheduler: Broadcasting large task binary with size 1375.6 KiB
25/12/06 20:14:56 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
25/12/06 20:14:56 WARN DAGScheduler: Broadcasting large task binary with size 1717.8 KiB
25/12/06 20:14:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
25/12/06 20:14:59 WARN DAGScheduler: Broadcasting large task binary with size 3.0 MiB
25/12/06 20:15:03 WARN DAGScheduler: Broadcasting large task binary with size 3.9 MiB
25/12/06 20:15:03 WARN DAGScheduler: Broadcasting large task binary with size 1143.3 KiB
25/12/06 20:15:06 WARN DAGScheduler: Broadcasting large task binary with size 2042.8 KiB
25/12/06 20:15:09 WARN MemoryStore: Not enough space to cache rdd_1305_3 in memory! (computed 12.0 MiB so far)
25/12/06 20:15:09 WARN MemoryStore: Not enough space to cache rdd_1305_26 in memory! (computed 12.0 MiB so far)
25/12/06 20:15:09 WARN BlockManager: Persisting block rdd_1305_26 to disk instead.
25/12/06 20:15:09 WARN BlockManager: Persisting block rdd_1305_3 to disk instead.
25/12/06 20:15:50 WARN DAGScheduler: Broadcasting large task binary with size 1430.3 KiB
25/12/06 20:15:50 WARN DAGScheduler: Broadcasting large task binary with size 1144.7 KiB
25/12/06 20:15:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
25/12/06 20:15:57 WARN DAGScheduler: Broadcasting large task binary with size 1626.0 KiB
25/12/06 20:15:59 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB
25/12/06 20:16:00 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
25/12/06 20:16:06 WARN DAGScheduler: Broadcasting large task binary with size 1954.0 KiB
25/12/06 20:16:07 WARN DAGScheduler: Broadcasting large task binary with size 1457.9 KiB
25/12/06 20:16:11 WARN MemoryStore: Not enough space to cache rdd_1589_31 in memory! (computed 12.0 MiB so far)
25/12/06 20:16:11 WARN BlockManager: Persisting block rdd_1589_31 to disk instead.
25/12/06 20:16:53 WARN DAGScheduler: Broadcasting large task binary with size 1430.3 KiB
25/12/06 20:16:53 WARN DAGScheduler: Broadcasting large task binary with size 1144.7 KiB
25/12/06 20:16:54 WARN MemoryStore: Not enough space to cache rdd_1594_1 in memory! (computed 12.0 MiB so far)
25/12/06 20:16:54 WARN MemoryStore: Not enough space to cache rdd_1594_2 in memory! (computed 8.0 MiB so far)
25/12/06 20:16:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
25/12/06 20:16:59 WARN DAGScheduler: Broadcasting large task binary with size 1626.0 KiB
25/12/06 20:17:01 WARN DAGScheduler: Broadcasting large task binary with size 3.2 MiB
25/12/06 20:17:01 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
25/12/06 20:17:04 WARN MemoryStore: Not enough space to cache rdd_1594_1 in memory! (computed 12.0 MiB so far)
25/12/06 20:17:08 WARN DAGScheduler: Broadcasting large task binary with size 4.5 MiB
25/12/06 20:17:17 ERROR Executor: Exception in task 1.0 in stage 659.0 (TID 11116)
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR Executor: Exception in task 12.0 in stage 659.0 (TID 11127)
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR Executor: Exception in task 27.0 in stage 659.0 (TID 11142)
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR Executor: Exception in task 7.0 in stage 659.0 (TID 11122)
java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.Integer.valueOf(Integer.java:1059)
at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:246)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:246)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.mutable.ArrayOps$ofInt.map(ArrayOps.scala:246)
at org.apache.spark.ml.tree.impl.DTStatsAggregator.<init>(DTStatsAggregator.scala:54)
at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$22(RandomForest.scala:651)
at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$22$adapted(RandomForest.scala:647)
at org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$6148/0x0000000841998040.apply(Unknown Source)
at scala.Array$.tabulate(Array.scala:418)
at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$21(RandomForest.scala:647)
at org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$6138/0x000000084195e840.apply(Unknown Source)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:858)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:858)
at org.apache.spark.rdd.RDD$$Lambda$4717/0x0000000841702c40.apply(Unknown Source)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
at org.apache.spark.executor.Executor$TaskRunner$$Lambda$2509/0x0000000840fde040.apply(Unknown Source)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
25/12/06 20:17:17 ERROR Executor: Exception in task 26.0 in stage 659.0 (TID 11141)
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR Executor: Exception in task 31.0 in stage 659.0 (TID 11146)
java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.reflect.Array.newInstance(Array.java:78)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2098)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1675)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2496)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2390)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:489)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:447)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302)
at org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1735)
at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:961)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:1276)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1377)
at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:379)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
at org.apache.spark.executor.Executor$TaskRunner$$Lambda$2509/0x0000000840fde040.apply(Unknown Source)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 27.0 in stage 659.0 (TID 11142),5,main]
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 1.0 in stage 659.0 (TID 11116),5,main]
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 7.0 in stage 659.0 (TID 11122),5,main]
java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.Integer.valueOf(Integer.java:1059)
at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:246)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:246)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.mutable.ArrayOps$ofInt.map(ArrayOps.scala:246)
at org.apache.spark.ml.tree.impl.DTStatsAggregator.<init>(DTStatsAggregator.scala:54)
at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$22(RandomForest.scala:651)
at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$22$adapted(RandomForest.scala:647)
at org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$6148/0x0000000841998040.apply(Unknown Source)
at scala.Array$.tabulate(Array.scala:418)
at org.apache.spark.ml.tree.impl.RandomForest$.$anonfun$findBestSplits$21(RandomForest.scala:647)
at org.apache.spark.ml.tree.impl.RandomForest$$$Lambda$6138/0x000000084195e840.apply(Unknown Source)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:858)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:858)
at org.apache.spark.rdd.RDD$$Lambda$4717/0x0000000841702c40.apply(Unknown Source)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
at org.apache.spark.executor.Executor$TaskRunner$$Lambda$2509/0x0000000840fde040.apply(Unknown Source)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 31.0 in stage 659.0 (TID 11146),5,main]
java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.reflect.Array.newInstance(Array.java:78)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2098)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1675)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2496)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2390)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:489)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:447)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302)
at org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1735)
at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:961)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:1276)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1377)
at org.apache.spark.storage.BlockManager.getOrElseUpdateRDDBlock(BlockManager.scala:1343)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:379)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:621)
at org.apache.spark.executor.Executor$TaskRunner$$Lambda$2509/0x0000000840fde040.apply(Unknown Source)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 26.0 in stage 659.0 (TID 11141),5,main]
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 12.0 in stage 659.0 (TID 11127),5,main]
java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 WARN TaskSetManager: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR TaskSetManager: Task 26 in stage 659.0 failed 1 times; aborting job
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5c068313 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@13d0dc8a rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@9dc0ec5 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@731ff9b3 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@69889f22 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@515d9541 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 28, active threads = 28, queued tasks = 0, completed tasks = 11119]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Instrumentation: org.apache.spark.SparkException: Job 367 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1259)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1257)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:3129)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$stop$3(DAGScheduler.scala:3015)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1375)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:3015)
at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2258)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1375)
at org.apache.spark.SparkContext.stop(SparkContext.scala:2258)
at org.apache.spark.SparkContext.stop(SparkContext.scala:2211)
at org.apache.spark.SparkContext.$anonfun$new$34(SparkContext.scala:681)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:995)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2393)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2414)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2433)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1049)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1048)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$collectAsMap$1(PairRDDFunctions.scala:738)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:737)
at org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:663)
at org.apache.spark.ml.tree.impl.RandomForest$.runBagged(RandomForest.scala:208)
at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:302)
at org.apache.spark.ml.classification.RandomForestClassifier.$anonfun$train$1(RandomForestClassifier.scala:168)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:139)
at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:47)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:78)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Instrumentation: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2898)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2834)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2833)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2833)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1253)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1253)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1253)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3102)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3036)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3025)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:995)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2393)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2414)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2433)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1049)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1048)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$collectAsMap$1(PairRDDFunctions.scala:738)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:737)
at org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:663)
at org.apache.spark.ml.tree.impl.RandomForest$.runBagged(RandomForest.scala:208)
at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:302)
at org.apache.spark.ml.classification.RandomForestClassifier.$anonfun$train$1(RandomForestClassifier.scala:168)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:139)
at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:47)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:78)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.OutOfMemoryError: Java heap space
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@6b13b34a rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 26, active threads = 26, queued tasks = 0, completed tasks = 11121]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 WARN TaskSetManager: Lost task 13.0 in stage 659.0 (TID 11128) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): TaskKilled (Stage cancelled: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space
Driver stacktrace:)
25/12/06 20:17:17 WARN TaskSetManager: Lost task 24.0 in stage 659.0 (TID 11139) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): TaskKilled (Stage cancelled: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space
Driver stacktrace:)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5c124c3 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 4, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 WARN TaskSetManager: Lost task 8.0 in stage 659.0 (TID 11123) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): TaskKilled (Stage cancelled: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space
Driver stacktrace:)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@4615ec86 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 WARN TaskSetManager: Lost task 6.0 in stage 659.0 (TID 11121) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): TaskKilled (Stage cancelled: Job aborted due to stage failure: Task 26 in stage 659.0 failed 1 times, most recent failure: Lost task 26.0 in stage 659.0 (TID 11141) (ip-172-31-34-249.sa-east-1.compute.internal executor driver): java.lang.OutOfMemoryError: Java heap space
Driver stacktrace:)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@19b503ef rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@52ca79a8 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Shutting down, pool size = 2, active threads = 0, queued tasks = 0, completed tasks = 11125]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@7f6068b4 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@565aad10 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5a9a8d76 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@48817daf rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@52f49aa rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@28e890e1 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@6743e62 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@5411bc29 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@395bf5ca rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$6963/0x0000000841a5bc40@2d313338 rejected from java.util.concurrent.ThreadPoolExecutor@54d5f5ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11125]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:139)
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:838)
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:811)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@7896f707 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@4d11d55d rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@12fabfe rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@7a208397 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@ca0aba6 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@779437a9 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5417efe6 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@55f68507 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@5182b16a rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@3bd78595 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@240eccbd rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@34becbc0 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@785cf40a rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@4623354c rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@f309595 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:17 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@53f80d15 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 11144]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
25/12/06 20:17:18 ERROR Inbox: Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@59c7ae31 rejected from java.util.concurrent.ThreadPoolExecutor@3c420f53[Shutting down, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 11145]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:364)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3699, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "/tmp/ipykernel_240763/3648131322.py", line 64, in <module>
tvs_model_rf = tvs_rf.fit(train)
^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/base.py", line 205, in fit
return self._fit(dataset)
^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/tuning.py", line 1464, in _fit
for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
File "/usr/lib/python3.12/multiprocessing/pool.py", line 873, in next
raise value
File "/usr/lib/python3.12/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/tuning.py", line 1464, in <lambda>
for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
^^^
File "/opt/spark/python/pyspark/util.py", line 342, in wrapped
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/tuning.py", line 113, in singleTask
index, model = next(modelIter)
^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/base.py", line 98, in __next__
return index, self.fitSingleModel(index)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/base.py", line 156, in fitSingleModel
return estimator.fit(dataset, paramMaps[index])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/base.py", line 203, in fit
return self.copy(params)._fit(dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/pipeline.py", line 134, in _fit
model = stage.fit(dataset)
^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/base.py", line 205, in fit
return self._fit(dataset)
^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/wrapper.py", line 381, in _fit
java_model = self._fit_java(dataset)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/ml/wrapper.py", line 378, in _fit_java
return self._java_obj.fit(dataset._jdf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
return_value = get_return_value(
^^^^^^^^^^^^^^^^^
File "/opt/spark/python/pyspark/errors/exceptions/captured.py", line 179, in deco
return f(*a, **kw)
^^^^^^^^^^^
File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: <exception str() failed>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) [... skipping hidden 1 frame] Cell In[57], line 64 63 print("Treinando Random Forest com tuning...") ---> 64 tvs_model_rf = tvs_rf.fit(train) 66 # Melhor modelo dentro do grid File /opt/spark/python/pyspark/ml/base.py:205, in Estimator.fit(self, dataset, params) 204 else: --> 205 return self._fit(dataset) 206 else: File /opt/spark/python/pyspark/ml/tuning.py:1464, in TrainValidationSplit._fit(self, dataset) 1463 metrics = [None] * numModels -> 1464 for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks): 1465 metrics[j] = metric File /usr/lib/python3.12/multiprocessing/pool.py:873, in IMapIterator.next(self, timeout) 872 return value --> 873 raise value File /usr/lib/python3.12/multiprocessing/pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception) 124 try: --> 125 result = (True, func(*args, **kwds)) 126 except Exception as e: File /opt/spark/python/pyspark/ml/tuning.py:1464, in TrainValidationSplit._fit.<locals>.<lambda>(f) 1463 metrics = [None] * numModels -> 1464 for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks): 1465 metrics[j] = metric File /opt/spark/python/pyspark/util.py:342, in inheritable_thread_target.<locals>.wrapped(*args, **kwargs) 341 SparkContext._active_spark_context._jsc.sc().setLocalProperties(properties) --> 342 return f(*args, **kwargs) File /opt/spark/python/pyspark/ml/tuning.py:113, in _parallelFitTasks.<locals>.singleTask() 112 def singleTask() -> Tuple[int, float, Transformer]: --> 113 index, model = next(modelIter) 114 # TODO: duplicate evaluator to take extra params from input 115 # Note: Supporting tuning params in evaluator need update method 116 # `MetaAlgorithmReadWrite.getAllNestedStages`, make it return 117 # all nested stages and evaluators File /opt/spark/python/pyspark/ml/base.py:98, in _FitMultipleIterator.__next__(self) 97 self.counter += 1 ---> 98 return index, self.fitSingleModel(index) File /opt/spark/python/pyspark/ml/base.py:156, in Estimator.fitMultiple.<locals>.fitSingleModel(index) 155 def fitSingleModel(index: int) -> M: --> 156 return estimator.fit(dataset, paramMaps[index]) File /opt/spark/python/pyspark/ml/base.py:203, in Estimator.fit(self, dataset, params) 202 if params: --> 203 return self.copy(params)._fit(dataset) 204 else: File /opt/spark/python/pyspark/ml/pipeline.py:134, in Pipeline._fit(self, dataset) 133 else: # must be an Estimator --> 134 model = stage.fit(dataset) 135 transformers.append(model) File /opt/spark/python/pyspark/ml/base.py:205, in Estimator.fit(self, dataset, params) 204 else: --> 205 return self._fit(dataset) 206 else: File /opt/spark/python/pyspark/ml/wrapper.py:381, in JavaEstimator._fit(self, dataset) 380 def _fit(self, dataset: DataFrame) -> JM: --> 381 java_model = self._fit_java(dataset) 382 model = self._create_model(java_model) File /opt/spark/python/pyspark/ml/wrapper.py:378, in JavaEstimator._fit_java(self, dataset) 377 self._transfer_params_to_java() --> 378 return self._java_obj.fit(dataset._jdf) File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args) 1321 answer = self.gateway_client.send_command(command) -> 1322 return_value = get_return_value( 1323 answer, self.gateway_client, self.target_id, self.name) 1325 for temp_arg in temp_args: File /opt/spark/python/pyspark/errors/exceptions/captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw) 178 try: --> 179 return f(*a, **kw) 180 except Py4JJavaError as e: File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) 329 else: <class 'str'>: (<class 'ConnectionRefusedError'>, ConnectionRefusedError(111, 'Connection refused')) During handling of the above exception, another exception occurred: ConnectionRefusedError Traceback (most recent call last) [... skipping hidden 1 frame] File /opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py:2205, in InteractiveShell.showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code) 2202 traceback.print_exc() 2203 return None -> 2205 self._showtraceback(etype, value, stb) 2206 if self.call_pdb: 2207 # drop into debugger 2208 self.debugger(force=True) File /opt/jupyterhub-venv/lib/python3.12/site-packages/ipykernel/zmqshell.py:642, in ZMQInteractiveShell._showtraceback(self, etype, evalue, stb) 636 sys.stdout.flush() 637 sys.stderr.flush() 639 exc_content = { 640 "traceback": stb, 641 "ename": str(etype.__name__), --> 642 "evalue": str(evalue), 643 } 645 dh = self.displayhook 646 # Send exception info over pub socket for other clients than the caller 647 # to pick up File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:471, in Py4JJavaError.__str__(self) 469 def __str__(self): 470 gateway_client = self.java_exception._gateway_client --> 471 answer = gateway_client.send_command(self.exception_cmd) 472 return_value = get_return_value(answer, gateway_client, None, None) 473 # Note: technically this should return a bytestring 'str' rather than 474 # unicodes in Python 2; however, it can return unicodes for now. 475 # See https://github.com/bartdag/py4j/issues/306 for more details. File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1036, in GatewayClient.send_command(self, command, retry, binary) 1015 def send_command(self, command, retry=True, binary=False): 1016 """Sends a command to the JVM. This method is not intended to be 1017 called directly by Py4J users. It is usually called by 1018 :class:`JavaMember` instances. (...) 1034 if `binary` is `True`. 1035 """ -> 1036 connection = self._get_connection() 1037 try: 1038 response = connection.send_command(command) File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:284, in JavaClient._get_connection(self) 281 pass 283 if connection is None or connection.socket is None: --> 284 connection = self._create_new_connection() 285 return connection File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:291, in JavaClient._create_new_connection(self) 287 def _create_new_connection(self): 288 connection = ClientServerConnection( 289 self.java_parameters, self.python_parameters, 290 self.gateway_property, self) --> 291 connection.connect_to_java_server() 292 self.set_thread_connection(connection) 293 return connection File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:438, in ClientServerConnection.connect_to_java_server(self) 435 if self.ssl_context: 436 self.socket = self.ssl_context.wrap_socket( 437 self.socket, server_hostname=self.java_address) --> 438 self.socket.connect((self.java_address, self.java_port)) 439 self.stream = self.socket.makefile("rb") 440 self.is_connected = True ConnectionRefusedError: [Errno 111] Connection refused
--------------------------------------------------------------------------- ConnectionRefusedError Traceback (most recent call last) File /opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/async_helpers.py:128, in _pseudo_sync_runner(coro) 120 """ 121 A runner that does not really allow async execution, and just advance the coroutine. 122 (...) 125 Credit to Nathaniel Smith 126 """ 127 try: --> 128 coro.send(None) 129 except StopIteration as exc: 130 return exc.value File /opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3413, in InteractiveShell.run_cell_async(self, raw_cell, store_history, silent, shell_futures, transformed_cell, preprocessing_exc_tuple, cell_id) 3409 exec_count = self.execution_count 3410 if result.error_in_exec: 3411 # Store formatted traceback and error details 3412 self.history_manager.exceptions[exec_count] = ( -> 3413 self._format_exception_for_storage(result.error_in_exec) 3414 ) 3416 # Each cell is a *single* input, regardless of how many lines it has 3417 self.execution_count += 1 File /opt/jupyterhub-venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3474, in InteractiveShell._format_exception_for_storage(self, exception, filename, running_compiled_code) 3470 except Exception: 3471 # In case formatting fails, fallback to Python's built-in formatting. 3472 stb = traceback.format_exception(etype, evalue, tb) -> 3474 return {"ename": etype.__name__, "evalue": str(evalue), "traceback": stb} File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:471, in Py4JJavaError.__str__(self) 469 def __str__(self): 470 gateway_client = self.java_exception._gateway_client --> 471 answer = gateway_client.send_command(self.exception_cmd) 472 return_value = get_return_value(answer, gateway_client, None, None) 473 # Note: technically this should return a bytestring 'str' rather than 474 # unicodes in Python 2; however, it can return unicodes for now. 475 # See https://github.com/bartdag/py4j/issues/306 for more details. File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1036, in GatewayClient.send_command(self, command, retry, binary) 1015 def send_command(self, command, retry=True, binary=False): 1016 """Sends a command to the JVM. This method is not intended to be 1017 called directly by Py4J users. It is usually called by 1018 :class:`JavaMember` instances. (...) 1034 if `binary` is `True`. 1035 """ -> 1036 connection = self._get_connection() 1037 try: 1038 response = connection.send_command(command) File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:284, in JavaClient._get_connection(self) 281 pass 283 if connection is None or connection.socket is None: --> 284 connection = self._create_new_connection() 285 return connection File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:291, in JavaClient._create_new_connection(self) 287 def _create_new_connection(self): 288 connection = ClientServerConnection( 289 self.java_parameters, self.python_parameters, 290 self.gateway_property, self) --> 291 connection.connect_to_java_server() 292 self.set_thread_connection(connection) 293 return connection File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py:438, in ClientServerConnection.connect_to_java_server(self) 435 if self.ssl_context: 436 self.socket = self.ssl_context.wrap_socket( 437 self.socket, server_hostname=self.java_address) --> 438 self.socket.connect((self.java_address, self.java_port)) 439 self.stream = self.socket.makefile("rb") 440 self.is_connected = True ConnectionRefusedError: [Errno 111] Connection refused
from pyspark.ml.classification import GBTClassifier
# Modelo
gbt = GBTClassifier(
labelCol="label",
featuresCol="features",
maxIter=30, # número de árvores
maxDepth=5, # profundidade limitada no Spark
stepSize=0.1 # learning rate
)
# Pipeline
pipeline_gbt = Pipeline(stages=indexers + encoders + [assembler, gbt])
print("Treinando modelo GBT...")
gbt_model = pipeline_gbt.fit(train)
# Predição
pred_gbt = gbt_model.transform(test)
# Avaliação
evaluator_auc = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction")
evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
evaluator_acc = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
auc_gbt = evaluator_auc.evaluate(pred_gbt)
f1_gbt = evaluator_f1.evaluate(pred_gbt)
acc_gbt = evaluator_acc.evaluate(pred_gbt)
print("=== GBT RESULTS ===")
print("AUC: ", auc_gbt)
print("F1 score:", f1_gbt)
print("Accuracy:", acc_gbt)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[56], line 13 4 gbt = GBTClassifier( 5 labelCol="label", 6 featuresCol="features", (...) 9 stepSize=0.1 # learning rate 10 ) 12 # Pipeline ---> 13 pipeline_gbt = Pipeline(stages=indexers + encoders + [assembler, gbt]) 15 print("Treinando modelo GBT...") 16 gbt_model = pipeline_gbt.fit(train) NameError: name 'indexers' is not defined
from pyspark.sql import functions as F
# usar df_small ou df_model (escolhe um dos dois)
base = df_small
area_features = (
base
.groupBy("community_area")
.agg(
F.count("*").alias("total_crimes"),
F.avg("label").alias("arrest_rate"),
F.avg(F.when(F.col("domestic") == "true", 1).otherwise(0)).alias("domestic_rate"),
F.avg("hour").alias("avg_hour")
)
)
area_features.show(10, truncate=False)
print("Nº de bairros agregados:", area_features.count())
+--------------+------------+-------------------+--------------------+------------------+ |community_area|total_crimes|arrest_rate |domestic_rate |avg_hour | +--------------+------------+-------------------+--------------------+------------------+ |31 |3496 |0.2640160183066362 |0.10411899313501144 |13.031178489702517| |65 |2681 |0.21484520701230883|0.11301753077209996 |13.332711674748229| |53 |5775 |0.23965367965367965|0.1986147186147186 |13.297316017316017| |34 |1311 |0.2288329519450801 |0.08009153318077804 |13.273073989321128| |28 |10610 |0.24052780395852968|0.07935909519321395 |12.90961357210179 | |76 |1993 |0.2222779729051681 |0.061214249874560964|12.08078273958856 | |27 |6650 |0.3924812030075188 |0.16421052631578947 |13.256691729323308| |26 |6693 |0.43373673987748396|0.15598386373823397 |13.538024802031973| |44 |7689 |0.2562101703732605 |0.18012745480556638 |13.040057224606581| |12 |643 |0.09486780715396578|0.08709175738724728 |13.211508553654744| +--------------+------------+-------------------+--------------------+------------------+ only showing top 10 rows
[Stage 182:==========> (6 + 26) / 32]
Nº de bairros agregados: 78
from pyspark.sql.functions import log1p
area_features = area_features.withColumn(
"log_total_crimes",
log1p(F.col("total_crimes"))
)
from pyspark.ml.feature import VectorAssembler, StandardScaler
# escolher quais colunas entram no cluster
feature_cols = ["log_total_crimes", "arrest_rate", "domestic_rate", "avg_hour"]
assembler = VectorAssembler(
inputCols=feature_cols,
outputCol="features_raw"
)
assembled = assembler.transform(area_features)
scaler = StandardScaler(
inputCol="features_raw",
outputCol="features",
withMean=True,
withStd=True
)
scaled_data = scaler.fit(assembled).transform(assembled)
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator(
featuresCol="features",
predictionCol="prediction",
metricName="silhouette"
)
for k in [2, 3, 4, 5, 6]:
kmeans = KMeans(k=k, seed=42, featuresCol="features")
model_k = kmeans.fit(scaled_data)
preds_k = model_k.transform(scaled_data)
sil = evaluator.evaluate(preds_k)
print(f"k = {k} -> Silhouette = {sil:.4f}")
25/12/06 21:24:39 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
k = 2 -> Silhouette = 0.3591
k = 3 -> Silhouette = 0.2230
k = 4 -> Silhouette = 0.4233
k = 5 -> Silhouette = 0.3192
[Stage 490:=====================================================> (31 + 1) / 32]
k = 6 -> Silhouette = 0.3374
from pyspark.sql.functions import col
# supondo que este DF seja o resultado por área com o cluster
# (adapte o nome para o que você usou)
clusters_area_sem_outlier = area_clusters.filter(col("community_area") != 0)
clusters_area_sem_outlier.groupBy("prediction").agg(
F.count("*").alias("num_bairros"),
F.avg("total_crimes").alias("media_crimes"),
F.avg("arrest_rate").alias("media_arrest_rate"),
F.avg("domestic_rate").alias("media_domestic_rate"),
F.avg("avg_hour").alias("media_avg_hour")
).orderBy("prediction").show(truncate=False)
[Stage 551:================================> (19 + 13) / 32]
+----------+-----------+------------------+-------------------+-------------------+------------------+ |prediction|num_bairros|media_crimes |media_arrest_rate |media_domestic_rate|media_avg_hour | +----------+-----------+------------------+-------------------+-------------------+------------------+ |0 |36 |2080.1944444444443|0.18900757828887876|0.13223657077622575|13.072213710432358| |1 |30 |7234.033333333334 |0.2911635801673923 |0.1542232692542025 |13.276360871827823| |2 |11 |5625.545454545455 |0.20001915870740763|0.0860961990977164 |12.627060485980657| +----------+-----------+------------------+-------------------+-------------------+------------------+
from pyspark.sql.functions import col, count, avg
crimes_area = crimes_df2.groupBy("community_area").agg(
count("*").alias("total_crimes"),
avg(col("arrest").cast("int")).alias("arrest_rate"),
avg(col("domestic").cast("int")).alias("domestic_rate"),
avg(col("hour")).alias("avg_hour")
)
crimes_area.show(5)
[Stage 606:==========================================> (25 + 7) / 32]
+--------------+------------+-------------------+-------------------+------------------+ |community_area|total_crimes| arrest_rate| domestic_rate| avg_hour| +--------------+------------+-------------------+-------------------+------------------+ | 31| 70496|0.25990127099409893|0.10858772128915116|13.016454834316841| | 65| 52703|0.22070849856744398|0.11919625068781663|13.415308426465286| | 53| 117229|0.24314802651221115| 0.1989951291915823|13.349981659828199| | 34| 27148|0.22642551937527627|0.07499631648740239| 13.48276116104317| | 28| 217006| 0.2418274149101868|0.08500225800208289|13.000184326700644| +--------------+------------+-------------------+-------------------+------------------+ only showing top 5 rows
from pyspark.sql.functions import col
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler, StandardScaler
# ----------------------------------------------------------
# 1. REMOVER OUTLIER (community_area = 0)
# ----------------------------------------------------------
data_no_outlier = crimes_area.filter(col("community_area") != 0)
# ----------------------------------------------------------
# 2. MONTAR FEATURES PARA O K-MEANS
# ----------------------------------------------------------
assembler = VectorAssembler(
inputCols=["total_crimes", "arrest_rate", "domestic_rate", "avg_hour"],
outputCol="features_raw"
)
assembled = assembler.transform(data_no_outlier)
# Padronização (melhora muuuuito o clustering)
scaler = StandardScaler(inputCol="features_raw", outputCol="features")
scaled_data = scaler.fit(assembled).transform(assembled)
# ----------------------------------------------------------
# 3. RODAR O K-MEANS JÁ SEM O OUTLIER
# ----------------------------------------------------------
best_k = 4 # depois de analisar a silhouette
kmeans_final = KMeans(k=best_k, seed=42, featuresCol="features")
kmeans_model = kmeans_final.fit(scaled_data)
area_clusters = kmeans_model.transform(scaled_data)
[Stage 659:=============> (8 + 24) / 32]
+--------------+------------+-------------------+--------------------+------------------+----------+ |community_area|total_crimes|arrest_rate |domestic_rate |avg_hour |prediction| +--------------+------------+-------------------+--------------------+------------------+----------+ |4 |51280 |0.17593603744149766|0.07868564742589704 |12.963845553822154|0 | |5 |42353 |0.17873586286685714|0.05333742592024178 |13.041626331074541|0 | |6 |144756 |0.1797922020503468 |0.04656801790599353 |12.32049103318688 |0 | |7 |111392 |0.13280127836828498|0.0329018241884516 |12.847960356219478|0 | |8 |252839 |0.25234635479494855|0.044589640047619235|12.661353667749042|0 | |22 |148347 |0.19590554578117522|0.11105718349545322 |12.968553459119496|0 | |24 |210238 |0.1772657654658054 |0.07992370551470239 |12.782203978348347|0 | |28 |217006 |0.2418274149101868 |0.08500225800208289 |13.000184326700644|0 | |56 |59169 |0.23860467474522132|0.10801264175497305 |12.884804542919435|0 | |57 |25800 |0.18294573643410852|0.10484496124031008 |12.698527131782946|0 | |76 |43640 |0.2442942254812099 |0.05524747937671861 |12.206118240146655|0 | |77 |71823 |0.21203514194617323|0.08792448101583059 |13.001308772955738|0 | |9 |7128 |0.11307519640852974|0.15404040404040403 |12.687149270482603|1 | |10 |31280 |0.13945012787723784|0.11975703324808185 |12.727557544757033|1 | |11 |28643 |0.16195929197360612|0.1388122752504975 |12.97950633662675 |1 | |12 |13342 |0.10433218408034778|0.10283315844700944 |12.995952630789986|1 | |14 |64187 |0.22630750775079067|0.1353233520806394 |13.109975540218425|1 | |16 |80918 |0.18046664524580439|0.12904421760300552 |12.873847598803728|1 | |17 |44302 |0.15493657171233804|0.1466750936752291 |13.118391946187531|1 | |18 |17077 |0.14212098143702057|0.1661298822978275 |13.029571938865141|1 | |19 |131404 |0.23468844175215367|0.15241545158442665 |13.010905299686463|1 | |20 |43084 |0.2519728901680438 |0.1700631324853774 |13.156601058397548|1 | |21 |66413 |0.1961664132022345 |0.1253670215168717 |13.01346121994188 |1 | |36 |16513 |0.17961606007388117|0.21255980136861866 |13.228123296796463|1 | |39 |41525 |0.17016255267910896|0.1717278747742324 |13.029163154726069|1 | |41 |46175 |0.14163508391987006|0.10388738494856524 |13.005024363833243|1 | |45 |36736 |0.18423344947735193|0.13937282229965156 |12.764890026132404|1 | |47 |10731 |0.20669089553629671|0.16335849408256453 |12.98154878389712 |1 | |48 |39197 |0.17391637115085337|0.14077607980202567 |12.780825063142588|1 | |50 |29002 |0.22712226742983244|0.16988483552858424 |13.028135990621337|1 | |51 |47279 |0.21641743691702447|0.1557351043803803 |13.12083588908395 |1 | |52 |35408 |0.20210122006326253|0.1478197017623136 |13.036517171260732|1 | |54 |32396 |0.22329917273737498|0.23515248796147672 |13.228299790097543|1 | |55 |15840 |0.17241161616161615|0.15132575757575759 |12.999810606060606|1 | |58 |69144 |0.2457769293069536 |0.13885514289020015 |12.953531759805623|1 | |60 |45725 |0.18508474576271186|0.11737561509021323 |12.926867140513941|1 | |62 |27501 |0.1845023817315734 |0.11886840478528053 |12.884549652739901|1 | |63 |65274 |0.25089622207923523|0.13594999540398933 |12.892529950669486|1 | |64 |28527 |0.15553685981701545|0.16692957548988677 |12.756581484207944|1 | |70 |65081 |0.15626680598024 |0.14776970237089165 |12.950753676188135|1 | |72 |26039 |0.12861477015246361|0.1197434617304812 |12.838434655708745|1 | |73 |85436 |0.21907626761552507|0.17461023456154315 |13.079849243878458|1 | |74 |16132 |0.15521943962310936|0.1462310934787999 |12.670654599553682|1 | |75 |57187 |0.21347509049259447|0.1681675905363107 |13.19233392204522 |1 | |23 |223982 |0.3750524595726442 |0.15371324481431545 |13.323400987579358|2 | |25 |448276 |0.3751773460992781 |0.16956294782678527 |13.273121915962488|2 | |26 |135522 |0.4350363778574696 |0.1521819335606027 |13.483353256297871|2 | |27 |134276 |0.3833224105573595 |0.15910512675385027 |13.264001012839227|2 | |29 |209901 |0.3711559258888714 |0.174210699329684 |13.253867299345883|2 | |40 |75810 |0.2740535549399815 |0.17120432660598867 |13.181651497163962|2 | |42 |114737 |0.2926344596773491 |0.1883001995868813 |13.422339785770937|2 | |43 |236555 |0.22633214263067786|0.2010652913698717 |13.050609794762318|2 | |44 |158031 |0.24979909005195183|0.1829767577247502 |13.091931329928938|2 | |46 |132217 |0.24912076359318394|0.161484529220902 |13.112890172973218|2 | |49 |190600 |0.2614323189926548 |0.17921301154249739 |13.193216159496327|2 | |53 |117229 |0.24314802651221115|0.1989951291915823 |13.349981659828199|2 | |61 |144352 |0.32352859676346707|0.14177150299268454 |13.353178341831079|2 | |66 |174517 |0.26446707197579605|0.17548433676948377 |13.210758837247948|2 | |67 |205117 |0.2889131568811946 |0.19373820794960925 |13.28728969319949 |2 | |68 |187126 |0.27648750040079945|0.20367025426717827 |13.088630120881117|2 | |69 |178267 |0.2581913646384356 |0.20433955807861243 |12.960901344612296|2 | |71 |203100 |0.25450024618414574|0.19845396356474643 |13.033929098966027|2 | |1 |110627 |0.26122013613313205|0.11367026132860875 |13.281522593941805|3 | |2 |91678 |0.16165274111564387|0.10712493728048168 |13.323861777089379|3 | |3 |105063 |0.2931288845740175 |0.08818518412761867 |13.421880205210208|3 | |13 |24157 |0.17059237488098689|0.09260255826468518 |13.265016351368134|3 | |15 |90729 |0.21881647543784236|0.14296421210417837 |13.258032161712352|3 | |30 |120464 |0.2744886439102138 |0.13746015407092577 |13.259330588391553|3 | |31 |70496 |0.25990127099409893|0.10858772128915116 |13.016454834316841|3 | |32 |177732 |0.24362523349762563|0.03131118763081493 |13.406685346476717|3 | |33 |55471 |0.30428512195561647|0.07829316219285753 |13.159398604676317|3 | |34 |27148 |0.22642551937527627|0.07499631648740239 |13.48276116104317 |3 | |35 |79828 |0.34574334819862707|0.10941023199879742 |13.2325625093952 |3 | |37 |23785 |0.34559596384275804|0.11242379651040572 |13.376371662812698|3 | |38 |99216 |0.2652596355426544 |0.13675213675213677 |13.21734397677794 |3 | |59 |29131 |0.23397754968933437|0.10689643335278569 |13.20273934983351 |3 | |65 |52703 |0.22070849856744398|0.11919625068781663 |13.415308426465286|3 | +--------------+------------+-------------------+--------------------+------------------+----------+
from pyspark.sql.functions import countDistinct, avg, round as Fround
# Resumo dos clusters
clusters_resumo = (
area_clusters
.groupBy("prediction")
.agg(
countDistinct("community_area").alias("num_bairros"),
Fround(avg("total_crimes"), 2).alias("media_crimes"),
Fround(avg("arrest_rate"), 3).alias("media_arrest_rate"),
Fround(avg("domestic_rate"), 3).alias("media_domestic_rate"),
Fround(avg("avg_hour"), 2).alias("media_avg_hour")
)
.orderBy("prediction")
)
clusters_resumo.show(truncate=False)
[Stage 665:=========================> (15 + 17) / 32]
+----------+-----------+------------+-----------------+-------------------+--------------+ |prediction|num_bairros|media_crimes|media_arrest_rate|media_domestic_rate|media_avg_hour| +----------+-----------+------------+-----------------+-------------------+--------------+ |0 |12 |114886.92 |0.201 |0.074 |12.78 | |1 |32 |42644.56 |0.184 |0.149 |12.97 | |2 |18 |181645.28 |0.3 |0.178 |13.22 | |3 |15 |77215.2 |0.255 |0.104 |13.29 | +----------+-----------+------------+-----------------+-------------------+--------------+
# Garantir que não tem community_area = 0 (outlier já removido no KMeans)
# Se o area_clusters já foi gerado a partir do data_no_outlier, nem precisa disso
area_clusters_clean = area_clusters.filter("community_area != 0")
area_clusters_pd = (
area_clusters_clean
.select("community_area", "prediction")
.dropDuplicates() # um registro por área
.toPandas()
)
# Fazer o join pelo código da community_area
gdf_clusters = gdf.merge(
area_clusters_pd,
on="community_area",
how="inner" # só áreas que têm cluster
)
gdf_clusters[["community_area", "prediction"]].head()
| community_area | prediction | |
|---|---|---|
| 0 | 26.0 | 2 |
| 1 | 27.0 | 2 |
| 2 | 25.0 | 2 |
| 3 | 23.0 | 2 |
| 4 | 29.0 | 2 |
fig, ax = plt.subplots(figsize=(8, 8))
gdf_clusters.plot(
column="prediction",
ax=ax,
categorical=True, # clusters são categorias
legend=True,
edgecolor="black",
linewidth=0.6
# você pode passar um cmap se quiser, ex.: cmap="tab10"
)
ax.set_title("Clusters de Community Areas por perfil de crimes e prisões", fontsize=14)
ax.set_axis_off()
plt.tight_layout()
plt.savefig("chicago_clusters_crime_map.png", dpi=150, bbox_inches="tight")
plt.show()